The Platform Evolution team in QNX / Blackberry had a special purpose in making the Blackberry 10 OS. It was responsible for integrating all, 500+, system and application components into one working operating system and making sure that the OS booted safe and secure as quickly as possible. Beside this, it was responsible for many other components and processes including the toolchain that produced the final image that was loaded on the hardware which was used by both the developers and by the build team producing signed builds for manufacturing.
The team consisted of 12 people in 2 locations, 10 in one and 2 in another 400 km away. From the seniority point of view the team was well distributed, including one principal, 2 senior and many regular and junior developers. However it was not a team, it was more of a group of developers working on similar problems running from one fire to another and trying to please everyone and at the same time being frustrated that highly creative and capable people needed to do repetitive work that did not require any creativity whatsoever.
- Toolchain for the BB10 image creation
- System integration
- Tooling for the configuration
- Boot mechanism and sequence
- File and service level permissions
- Security wipe
- Investigative work, root cause analysis
- OS upgrade over the air (originally was responsible for it)
- BB10 boot screen (originally was responsible for it)
One person, a team lead and agile coach, was added to the team with the purpose of reorganizing the team and its relationship and bring back the creative work that was more than needed to keep the team together and improve the system integration.
Timeframe of the agile transformation covered by this article: 15 months
Phase 0 – Observation
It was obvious even during the first days that the team had no boundaries and any other team could assign work to the team without any prior agreement. The team had a backlog of 500+ items with mixed priorities and many old, not even relevant requests. The team was doing the tasks that were somehow asked from them by email or verbally. The louder the request was the sooner it was completed.
The tasks were simple, system configuration, file permission related jobs, yet due to the complexity of the tooling and its dependencies it was difficult to know what to change and then how to test and validate. It made it even more complicated that building an image could take around 60-75 minutes depending on the computer. Yet it was an expectation that before anybody checked in their code to the repository they validate their changes by building an image especially if it required system configuration changes.
Due to these factors there were quite a few official builds from the 2 daily builds that were not usable due to the fact that they did not boot. In case of a non booting image the Platform Evolution team was expected to investigate and send the issue to the team that was causing the issue. The investigation required understanding the very complex boot sequence hence many team had no idea how to debug and even when the issue was sent to them with explanation they did not completely understand how they broke the boot process.
Every broken build affected thousands of developers which caused the company massive amount of loss.
Beside the system integration and configuration the team was responsible for other tools and processes that nobody paid attention to due to time limitation.
Phase 1 – Stability, creating order
- The first step was to create some order in the completely ad-hoc environment.
- The backlog was groomed, at least the high priority items, and the team started to push back any request that did not meet the newly established ticket criteria. The team politely asked for more information and if it was not received within 24 hours the ticket was reassigned to the requestor’s team. This stopped the teams’ backlog to be used as some sort of dumping ground for tickets that nobody understood.
- Basic rules were set that ensured higher quality and shortened wait cycles. Each commit required at least two approval on commit request and to efficiently provide that shifts were established where each team member in certain period of times were expected to focus on reviewing commit requests. This cut down wait time significantly and removed the need to nag others to look at reviews.
- A new rotating role was also established that monitored incoming requests to keep the backlog clean.
- Kanban was introduced to help the team to be able to focus on the work instead of trying to figure out what needed to be done.
These steps helped the team to bond and feel hopeful that the situation they were in is not forever. Slowly they felt that they were somewhat in control of what they were doing. This was the time when the team decided to get their logo and hang it up in the open space environment so everyone saw who they were and where they can request help if they needed.
- What is the purpose to exist as a team?
- Is the team doing what it really supposed to be doing?
- Drawing boundaries
- Based on the team’s purpose what are the requests that should be handled by the team and what by others?
- What are the roles that needed to be created or dropped to help hold the boundaries?
- Inner discipline
Phase 2 – Creating time
After stabilizing the team and the workflow, the next step was to create some free time for the team to be able to do meaningful work that shortened cycles and simplified tooling.
- Managers and even VP level were contacted and explained why the work needed to be organized differently than before. There were strong push backs from some teams who did not want the responsibility to integrate their components into the system, however the VP level encouragement and the team’s openness to help finally convinced others to learn and work together.
- There was a 2-day long training where different teams could send their candidate who could learn how to work with the tooling and understand the system configuration.
- In roughly 3 months most of the simpler tasks were pushed back to the requesters with details on how to do the task and how to validate. The team made sure that they were seen as cooperative partners and not people who just don’t want to work.
In the time that was freed many major innovation happened:
- The team built the foundation of a new build tool chain based on python rather than the original shell based script which was only available on QNX, Linux and OSX. The first version was only able to compile the system configuration to be ready for the final image build however did it in 2 minutes instead of 10-15. 2-3 iteration later it already exceeded the capabilities of the original system by its ability to cross check configurations. On top of this the installation was much simpler, because python was pushed into the common tools setup system, hence anybody in the company could use it even on Windows computer without any setup difficulties.
- The team switched from subversion to GIT and separated the system configuration source code from the output of the compilation. Until this point people needed to check in as one commit both the source of config and the compiled config into the same repository which caused repository conflict many times. Also the review of the change was much easier because people only needed to look at the source code and do not needed to check both source and the compiled configuration.
- A build system was set up with GIT that was triggered on every pull request and run the python compiler before it even got into the review pool. This again shortened the time the team needed to spend on reviewing, because incorrect configuration changes were automatically pushed back to the requestor.
By this time, the team’s self-confidence grew based on the successes and the much more creative work they did. Management also noticed the results that affected the whole software developer ecosystem in the company and gave the green light for further advancements.
- Agile mindset
- Periodically stop, think and sense
- What is important?
- Where does the team or other people lose time?
- With a bit of investment how can people be freed from repetitive work?
- What can the team do to help others be more responsible for their work?
- Feedback culture
- When something happens don’t just fix it, also think about how it can be prevented in the future.
- Stop sometimes and appreciate what the team has achieved.
- Give each other feedback about what went well and what didn’t. Learn from it by implementing change instantaneously.
- Shove nothing under the rug, speak your mind even if it is difficult, embrace and harness “conflict”, it helps everyone to grow
- How can the team measure indicators automatically that immediately give feedback and allows instantaneous course correcting response?
- What is that the team does not yet see and indicators could point to?
- How many things can the team handle without losing too much time by context switching all the time?
- Periodically stop, think and sense
- Agile practices
- SCRUM & Kanban
- Meeting types
- Iterative development
- Continuous integration
- Switching from bugzilla to JIRA
- SCRUM & Kanban
Phase 3 – Creative Innovation
By this time, the original system configuration work that the team did was 90% completed by the teams who originally requested these. They took complete responsibility for the integration of their own components. The team only did consultation services beside creative innovation when complex multi-component issues needed to be solved and kept extending the configuration system to be able to handle the new security requirements.
The team took on many other projects that were neglected before and solved problems that originally were not its responsibility. The management were kept informed all the time about the improvements the team was planning and completing and many times they were one step ahead. For example John Chen the new CEO asked to improve the user experience of the boot process by adding messages to it so people know what is happening and the company can communicate its value add. This happened exactly when the team completed a 4 months long side project to show text during boot. John Chen personally communicated the messages that were later checked by Marketing and integrated by the team into the system.
- Agile mindset
- Innovation & Prototyping
- If the team is not ashamed of the released prototype and it does not have sharp edges that is too late to release for getting the first feedbacks.
- What are the conditions for a prototype?
- How could they surprise others and even themselves with the next prototype?
- Incremental approach
- How can the team break down the full product into small and usable bits that immediately give experience to the users and then later on build on top of that?
- When is the right time to re-architect?
- When is the right time to change technology?
- Iterative process
- How can the team release a piece of usable software every 2-4 weeks to get feedback as soon as possible?
- Innovation & Prototyping
- Team cohesion
- Where is the point when the team members must take off the mask and show up as humans too in order to deeply bond as one team?
- Interpersonal skills
- How do we need to upgrade our communication to be able to include everyone in the process?
- What are the things the team can control and what can be only influenced and what are the different strategies for them?
The team extended the python system configuration compiler project and over 3 months created a new build system that built an image on an average computer, even on Windows, in 15 minutes allowing anybody to fully test their software before checking in their code. This had massive implication on quality since people could not blame the slowness or unavailability of the build system anymore for making mistakes. Also the freed build machine capacity was used for extending continuous integration for many other teams.
The improvements were done in several phases:
- Removing the dependency on a QNX virtual machine. Build time decreased to 60% and tool chain setup simplified significantly. – This version was released immediately to all parties.
- Experimental version that still relied on the original tooling and partially the modified bash scripts – new builds were setup to compare both performance and result
- First version of new system that did not use the old bash scripts – only working on OSX, not yet capable of signing the builds – build time compared to original is 10% – selected developers who did not need signed build used the system daily to test and to save time
- First version of build that all developers and testers started to use widely and it worked on all OS
- Final version that includes signing as well – all teams switched, including developers, build, testing, manufacturing
Major re-architecture of the security wipe bringing the speed of the wipe from the extreme 2-3 hours in some cases, to 10-20 minutes range.
- Defragmenting the non-volatile memory before the wipe to speed up the process by minimizing the number of lengthy secure memory block wipe cycles – each wipe takes similar amount of time regardless of the size of the block being erased
- Improving the user experience of the wipe by first adding a throbber instead of a static screen and later replacing it with a real progress circle with translated text messages that are correlated to the actual steps of the process.
The improvements were done in several steps:
- Adding throbber instead of steady image
- Introducing defragmentation before secure wipe
- Introducing highly secure security wipe for enterprise environment
- Adding real time progress indicator with messages
User experience of boot process
- Improving the user experience of the boot process by adding translated text to each phase showing what the phone is doing during boot
- Introducing Boot Error indicators, where users can find information about what
they need to do if the system cannot boot up for some reason
OS upgrade over the air
- One of the major challenges the BB users experienced were the failed OS upgrades. There were three components that needed to be upgraded at the same time, the read only OS, the radio partitions and the writeable user partition. The system was designed to roll back either read only partitions when the new system was unable to boot after 3 attempts, however it was not capable of rolling back each at the same time like a database system would roll back a transaction deleting all traces of it in case of failure. Because of that any rolled back system would be unusable due to the mismatching partitions.
- The team took on this challenge, although it was not its responsibility and solved it in 4 months in collaboration with 5 other teams that all needed to contribute due to dependencies of new features that were required. For example the file system snapshot mode, where all changes were written to disc during boot, but they were not committed to the directory tree hence if the boot failed the writable partition had no changes committed to it.
- The first release that included the upgraded over-the-air update system had 0% failure with the mismatching partitions roll back problem, while before that number was lingering around 1-5%. There were phones that still rolled back due to some error, however they were still usable just running on the original software version.
By this time the team was deeply trusted by management based on the results it has been constantly providing and was invited to participate in the secret project that later on was known as the Blackberry Android device.
It is possible to do agile transformation from the bottom up, however management support is essential. Without the initial support from the VP level it would have been very difficult if not impossible to create the agile transformation.
It was observed during the transition that the trust between the team and the management grew stronger as the team started to perform on a higher level and create value that the management was not expecting or did not think was possible. Also the more the team sensed the trust from the management the more they allowed themselves to be even more creative by pushing the boundaries.
The relationship with other teams in many cases improved and in some cases the team was an inspiration for others. In some cases however other teams saw the agile way of working as too much change too often and complained that they needed to learn new things again even if in the long run they benefited from it. These situation required delicate human to human interactions which required skills that not all team members possessed. That’s where the team lead / agile coach needed to help the most.
The team learned in small bits and pieces the agile mindset by working through the issues and by the agile coach gently guiding them as a group and by working with stretching everyone in one-on-one sessions.
Szervezetfejlesztő, Agilis coach, Változásvezető