Skip to Main Content
Apache Events The Apache Software Foundation
Apache 20th Anniversary Logo

This was extracted (@ 2024-05-15 21:10) from a list of minutes which have been approved by the Board.
Please Note The Board typically approves the minutes of the previous meeting at the beginning of every Board meeting; therefore, the list below does not normally contain details from the minutes of the most recent Board meeting.

WARNING: these pages may omit some original contents of the minutes.
This is due to changes in the layout of the source minutes over the years. Fixes are being worked on.

Meeting times vary, the exact schedule is available to ASF Members and Officers, search for "calendar" in the Foundation's private index page (svn:foundation/private-index.html).


17 Apr 2024 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Infra Team worked with Security and M&P on the xz vulnerability
 (CVE-2024-3094). We completed assessments and notifications to
 potentially-vulnerable projects/PMCs within 24 hours of receiving
 notice of the CVE.

- The Infra team is ready to begin the budgeting process.

Short Term Priorities
- Budget.

Long Range Priorities
- Artifact Distribution Platform.
- Atlassian Cloud

General Activity
- Working on a GitHub Actions policy for PMCs' usage of GHA, in order
 to properly share our limited resources/limits. This also includes
 some work on a scanner, looking for improper uses of GHA. Infra is
 also working on a usage dashboard.
- Deprecated paste.a.o, sent notification, and will turn off at the
 end of April.
- Continued work on asfquart, an Infra-written/supported set of
 functions for Quart-based apps (eg. Agenda, ADP, selfserve, etc)
- Upgraded a bunch of boxes from 18.04 to 22.04 in order to get mod_md
 to use LetsEncrypt for server certificates (instead of our wildcard).
- Preparing for our next Infra Roundtable.
- Improving our usage of AWS, from a security perspective. This will
 be a revamp of roles and IAM.

20 Mar 2024 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Upcoming team meeting in Sacremento, California.

- We are tracking to our budget, and will start planning the FY25
 Infra budget shortly.

Long Range Priorities
- Artifact Distribution Platform
- Agenda Tool

General Activity
- Regular activity of upgrading boxes off Ubuntu 18.04 (EOL'd) onto
 22.04. This also allows us to move more boxes to LetsEncrypt for
 their TLS connections.
- We continue working on moving Jira and Confluence to the Atlassian
 cloud-based products. Greg met with our assigned Atlassian people at
 their office in Austin, Texas.
- More work on spam handling, DKIM, and DMARC.
- We have continued our Roundtable events each month, and are starting
 a monthly newsletter to reach out to our communities about Infra's
- Added some monitoring of GitHub Action usage, as that is a single
 pool used by everybody. Tragedy of the Commons, and all that.

21 Feb 2024 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- We started a monthly newsletter (to be called "Inside Infra") as
 part of our outreach to the broader community. The first issue is at

- We will soon begin budgeting for FY25

General Activity
- Lots of progress on "asfquart", a Python package to provide a lot of
 basic features for Infra's web-based applications. It is built upon
 the standard "Quart" package, bringing in our typical ways to
 configure authentication, configuration, and other ASF-isms.
- Additional work on the Agenda Tool, based on its use for the January
 meeting. It will again be used for the February board meeting.
- The integration of our LDAP/keycloak service with the Atlassian
 cloud services is now working. We now have some months of testing
 before switching to the cloud versions of Jira and Confluence.
- We are moving many servers over to LetsEncrypt for their specific
 hostname, rather than using our *.a.o certificate. This assists with
 maintenance, and reduces our threat landscape.
- Reducing custom svn accounts, in favor of LDAP service accounts.
- Lots of work on hardening our systems, ACLs, and 2FA.
- Successful February Roundtable.
- Moving the asf.yaml stuff into a more available repository for
 review, PRs, issues, and feedback.
- Some long work on Bugzilla to perform an upgrade. There are some
 mixed dependencies that made this difficult.

17 Jan 2024 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Whimsy was turned off due to an RCE report. A new server was set up
 and built out. It was turned off on December 31st, 2023, and service
 was restored January 4th. Credit goes to the hard work of one of the
 Whimsy volunteers, who will remain private from our public minutes.
- Short report due to holidays. Infra always had staff on-call, but
 most forward-looking projects were on hold.

Short Term Priorities
- Successful linkage of our LDAP to Atlassian Cloud products

Long Range Priorities
- Artifact distribution

General Activity
- The Agenda Tool proceeds. It was used for the December Board
 meeting, except for working through the Action Items. The intent is
 for it to be used by the Chair for the January meeting, too.
- Work continues with Atlassian and Okta, and a third party consultant
 to hook up our LDAP as authoritative, and using SAML to
 integrate. Several important hurdles were cleared.
- Slow work on for the new svn master.
- Lots of work on the ant/maven/jdk matrix and deployments across our
 Jenkins clusters.
- We have some 18.04 boxes that are getting migrated to 22.04 LTS,
 allowing for the use of mod_md for easier certificate handling.
- Beginning planning for ApacheCon in Denver and Bratislava, along
 with a spring team meetup. Infra will be present at both ApacheCon
 events and provide some talks.
- Ran a community survey, asking people questions regarding Infra.
 We'll collate and publish.
- An Infra Roundtable was held January 10th.

20 Dec 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- An unexpected US$2700 cost for a contractor was needed, but not
 budgeted for FY24. There should be available funds for this as part
 of our Staffing costs.

Short Term Priorities
- Get the agenda tool into shape for running a Board meeting.
- Get ASF/Okta/Atlassian working together.

Long Range Priorities
- Artifact Distribution Platform
- Code Signing workflow

General Activity
- Lots of extensive work on moving the Agenda Tool fowards. It _may_
 be ready to assist with running the December meeting, and certainly
 for January.
- Continued struggling with Okta integration, in order to use the
 Atlassian cloud products, synchronized with our LDAP database. We
 have engaged a third-party to work through these issues.
- Additional work on our Kopia proof-of-concept for backups.
- Implemented an SMS-to-email gateway. The primary motivation here is
 to assist the Treasurer office with SMS-based 2FA.
- Our oauth.a.o solution is now based on keycloak, rather than custom
 in-house code. This will enable 2FA during the OAuth workflow.
- Some various Ansible work to manage a few boxes and processes. Infra
 does not have a specific demand for a move to Ansible, but we're
 using it for key tasks with an eye to future expansion.
- Working on Cassandra docker space/cleanup issues.
- Found/fixed a GitHub GraphQL issue that was holding up our account
 and group synchronization.
- With MarkT, movement on a new set of code signing services for our
 projects. Linux and Windows are easy, macOS is hard but it appears
 we have solutions for all platforms. Next up is to dev/test these
 new workflows. The result should open up code signing to ALL
 projects, in a simplified manner.
- Infra will stop auto-subscribing third-party email archives to our
 mailing lists, when they get created. We have been doing this for
 years, but it is improper to favor specific parties. It has also
 created minor issues with some of our email management.
- Continued work on setting up Azure/AWS VMs on demand. Spinning up
 new images and stabilizing the service.
- On December 6, Infra hosted a Roundtable that was well-attended and
 seemed to be very successful with the community.
- Upgrading a bunch of Ubuntu 18.04 boxes to 22.04 in order to use
 mod_md for LetsEncrypt certificates (rather than relying on our
 wildcard cert). The Bugzilla and Reviews boxes have presented some
 challenges to this process.

15 Nov 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Short Term Priorities
- Get Okta to work (concall Thursday 16th), to open migration to the
 Atlassian Cloud products.
- Begin work on ADP.

Long Range Priorities
- Artifact Distribution Platform to replace dist.a.o, svn:dist,
 archives.a.o, etc; improved workflow on releasing artifacts.
- More boxes on LetsEncrypt.

General Activity
- Works progresses on MFA processes/policy, which will be implemented
 using a self-hosted keycloak instance. This service is in-process on
 integration with the third-party Okta service in order to provide
 identity services fot the Atlassian Cloud products. There have been
 numerous obstacles, but the team is working through them with some
 assistance from Atlassian and Okta (to varying degrees).
- Continued work on the Agenda Tool, including navigation, revised UX,
 and cookies/preferences.
- The team has begun work on "asfquart" as a utility package for our
 many Quart-based python apps: agenda, selfserve, IRD, ADP, idm. The
 package will encompass our "best practices" to share across the apps.
 Some coding has begun, using a review of the existing apps.
- This month, infra held a great roundtable about trust networks and
 signing artifacts, including future paths such as sigstore.
- Emergency upgrade of COnfluence for a highly criticial CVE.
- Set up a monthly cron for Data Privacy to review all TLP websites
 for use of non-approved traffic trackers. Motomo is the supported
 mechanism for traffic recording/analysis.
- Additional requirements gathering for the Artifact Distribution
 Platform (ADP). Development will begin soon.
- Some work on the mailer script for our Subversion installation. The
 existing mailer is py2, which is inhibiting our upgrade. The new
 script is py3-compatible but missing some features.
- Tweaks to our notification and escalation policies around alerts.
- Working with the Security Team to automatically direct the flow of
 GitHub security notifications received by the org admins, onwards to
 the relevant PMC's security or private mailing lists.
- One of our staff has been out for a couple weeks due to a medical
 issue, and is back!
- Transferred a bunch of domains into our shared account at
 Namecheap. They had been held privately with shared management. Our
 new use of BitWarden allows for TOTP-based MFA on the shared
 account, enabling this new setup. Much better continuity.
- Experimenting with a new backup/alternate solution.
- Lots of great upgrades to our Yahoo-donated nodes for Jenkins.
- Work has started on reducing the exposure of our wildcert
 certificate, in favor of broder use of LetsEncrypt across our
 boxes. The major trip is needing to upgrade some boxes from Ubuntu
 18.04 to $latest.

18 Oct 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Our presence at Community Over Code was solid, with a track on
 Monday and a table throughout the conference. We got a lot of
 engagement from the wider community, and the team felt they heard
 and met with a lot of people and their concerns.

Short Term Priorities
- Figure out ASF/Okta identity
- Replace svn-mailer

Long Range Priorities
- Artifact Distribution Platform
- Migration to Jira Cloud and Confluence Cloud

General Activity
- Lots of work around DKIM, DMARC, and SPF to tighten up our email
 handling. Goal is to reduce spam and to improve trust in our MTA.
- Keycloak work is progressing to bring MFA to more ASF services.
- The team has been working on our track at the Community over Code
 conference in Halifax in October. We have accepted a few talks, and
 been working on adding our own talks. Lots of work over the past
 couple months went into preparing for the Infra track.
- The new Agenda Tool is progressing with improvements to the user
 experience and creation of a "cursor" to run through the agenda
 during a meeting.
- Held an Infrastructure Roundtable on August 2nd, which went very
 well, with lots of great feedback/interest.
- Our logs cluster has been upgraded to newer hardware, providing more
 processing and storage for the same price.
- Gradle wrap-up along with Fundraising/M&P planning for a press
 release and time in Halifax with Gradle, Inc. (note: product renamed
 to "Develocity")
- After much stoppage/frustration trying to solve our user
 authentication process for the migration to the Atlassian Cloud, we
 finally have some progress and are beginning the next steps. It is a
 complex mingling of keycloak and Okta, which (it seems) nobody has
 tried before.
- Fully migrated away from LastPass over to BitWarden.
- Stood up a new "Logo Server" for M&P (and other ecosystem members,
 eg. ComDev) to use in their press kit. This accepts SVG logos and
 produces multiple output formats, sizes, transparency, etc.
- Rebuilt new account creation workflow, on the Infra side. Process
 remains the same for Secretary.
- Wrapped another round of backup work.
- Unfortunately, slow work on an svn-mailer replacement has held up
 our migration to a Subversion server.
- Lots of work defining needs for the A.D.P, particularly with
 community feedback from the session in Halifax.
- Download stats have been moved from a Kibana dashboard to a much
 more functional page on infra-reports.a.o. From Halifax, also
 discussing about gathering/display Slack communication stats. Goal
 is to look at overall community participation (including mailing
 stats from lists.a.o)
- Enabled ephemeral GitHub Action runners on Azure.

20 Sep 2023 [Myrle Krantz]

A report was expected, but not received

16 Aug 2023 [Myrle Krantz]

A report was expected, but not received

19 Jul 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

General Activity
- Forward movement of our Pelican website generation system. A number
 of outstanding pull requests were merged, and the underlying stack
 is being updated.
- We are assembling an "Infra track" for C>C this fall.
- Our PubSub system had a surprising defect in the payload size, which
 was throwing off many parts of our systems. Found and corrected.
- The Gradle Enterprise installation is effectively complete, and we
 will start rolling it out to more projects.
- Jira (re)activation is complete. The new workflow has greatly
 reduced spam accounts and spam tickets. There have been some
 escalation of initial rejections and other issues, but the process
 is becoming smoother and cleaner.
- The Agenda Tool is progressing, with a focus for assisting the Chair
 to run the meeting.
- Wrap-up of LastPass -> BitWarden migration.
- Improved mail handling and spam prevention.
- gitbox grew a "file diff" feature for Apache Tomcat, to backfill the
 turned-off gitweb that was disabled last month.
- Continued work for ephemeral build nodes, with a lot of focus on
 constructing AMI images and using Ansible to manage the process.

21 Jun 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- has been turned off, with active blogs moved to
 selfserve by the projects, and redirects to their new location.
 Blogs not migrated remain active, in readonly form.

General Activity
- Continued work on a keycloak proof-of-concept to provide an OIDC
 (and OAuth) endpoint to services. It will also provide MFA on top of
 our LDAP accounts.
- Moving along an ansible test for secrets management, as a test of
 Ansible rollout and replacement for EYAML in our Puppet configs.
- Jira account management has been revised towards a more self-serve
 approach, and to simplify the workflow for a PMC. This includes a
 new "reactivate" feature in case Infra deactivates an account for
 idle-ness, and the contributor returns.
- Ephemeral GitHub Action nodes in Azure are up and running. This will
 allow projects to choose specific node types for their GHA workflows
 (eg. ARM or large memory nodes).
- has been made readonly. Projects with active blogs
 have moved to a selfserve approach (via asf.yaml). For projects that
 have migrated, redirects are in place on blogs.a.o. Eventually, an
 aggregator will be deployed for all projects' blogs.
- Gradle moves along, and more projects are getting their builds
 looped into the service.
- Work on migrating BitWarden accounts, so that we can turn off
 LastPass when our contract ends in August.

17 May 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Agenda Tool (read-only mode) is available now for testing. See

- FY24 budget was submitted to President last month; a minor revision
 was made to reduce the Travel/Meetup request.

Short Term Priorities
- Get the logo server fully managed by puppet as an Infra service.

Long Range Priorities
- All the secrets, and all the MFA. Keycloak, BitWarden, and Vault
 fully-deployed to our users.

General Activity
- Ansible improvements for Jenkins ephemeral nodes. Some outlook for
 secrets management on our machines.
- Continued work on move to Atlassian Cloud for our Jira/Confluence
 products. A major hurdle is dropping our current user count within
 their current limits. An interesting creative anecdote: one of our
 teammates used ChatGPT to power-up and write some SQL and Python to
 extract data from the Jira tables. A great force-multiplier!
- LastPass -> Bitwarden migration continues. Infra has moved, and we
 are now assisting other departments/Officers with the move. The LP
 contract ends in approximately August, while BW is monthly for now.
- Further work to set up GitHub Action runners in Azure. There is a
 lot of "magic" to get permissions/roles configured for this, to
 enable more people in the community to get these set up.
- Gradle continues to advance towards a productive build/analysis. We
 continue to work with Gradle,Inc on integration of their product
 into our Jenkins clusters.
- Some minor cleanup for the LDAP lockdown, specifically related to
 two LDAP servers with mismatched rules and a couple scripts that
 were missed due to this split-brain.
- Keycloak is in a "Proof of Concept" mode, and advancing through MFA
 support and recovery workflow. This should bring MFA to more of our
 systems, including MFA for our OAuth-based services.
- Our slack bot, Qbot, is now completely managed by Puppet.
 Development work has continued, adding new features and reliability.
- Blog migration is moving slowly. It is likely that many of the older
 blogs will simply be archived, rather than moved onto a new
 platform. A few projects have already stated their intent to
 discontinue blogging.
- Improvements in our pubsub will allow us to replay missed events
 with better reliability.
- The "gitweb" service has been disabled on gitbox.a.o due to
 operational problems, and maintenance concerns. Some redirects have
 been added to point people towards for web-based browsing
 of the repositories.
- The Artifact Distribution Platform is seeing some movement, with
 input from the Security Team and the Apache Beam project.
- Smoothing out the Roundtable scheduling/announce processes.

19 Apr 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Infra held a meetup in Nashville, TN, USA. The team accomplished
 several goals, along with social/team building. This was the first
 Infra-only team event since 2019; we canceled our Spring 2020 event.
- Agenda Tool is ready for testing
- Anonymous LDAP access/read is finally disabled.

The Infrastructure FY24 budget has been provided to the President's
Office for roll-up and presentation to the Board.

Short Term Priorities
- Test the Agenda Tool, and move to phase 2 (read/write).

Long Range Priorities
- Build/deploy the Artifact Distribution Platform.

General Activity
- The INFRA workspace in Jira has been made committer-only for safety
 and security. While we do not rely upon "security via obscurity", it
 does not hurt to hide things.
- LDAP cutover to new system and required-authn/authz has succeeded
 very well. The team had expected to fast-fix issue, but only a
 couple minor issues arose and were quickly resolved.
- Infra staff has moved to Bitwarden, and will roll this out to all
 (prior) LastPass users.
- Continued blogs.a.o migration.
- Much work has been done on our Gradle Enterprise instance/install.
 Some projects are involved in the trial/testing.
- Various small matters: Qbot, Twilio, PBX
- Experimenting with Hashicorp Vault for programmatic secrets management.
- Migrated to Bitwarden, as a LastPass replacement. This is also
 observed in the submitted budget.
- Continued work on Gradle implement, and on Atlassian SaaS move.
- Received final notification / re-up on cloud credits.
- Ran a podling-focused Roundtable. We need to revise notification and
 announcements to improve attendance.

22 Mar 2023 [Myrle Krantz]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- FY24 Budget process has begun.

Short Term Priorities
- Make our team meetup a success.
- LDAP cutover to newer system and ACLs.

Long Range Priorities
- Switch to Bitwarden (for Infra and other ASF Officers)
- Launch the Artifacts Distribution Platform

General Activity
- New self-serve portal has launched. This is newer tech (similar to
 other server-based webapps Infra has), making it easier to maintain
 and to add future features.
- The new self-serve is also used for an improved workflow for people
 to request a Jira account. It is now clearer and less work to
 request an account, less work for PMC members to review and approve
 creation of the Jira account.
- Implemented new policy for GitHub PRs from outside contributors to
 require explicit approval before building/running the PR code using
 our GitHub Action allocation (think: uncontrolled usage; also some
 security concerns). Default is now that each PR must be approved,
 though some projects have opted for first-approval implies automatic
 future approval.
- The Infra team helped set up and run our Apache STeVe instance for
 the Annual Members Meeting. The process completely successfully, and
 a couple runbooks were updated for Chair and Infra actions.
- Blogs migration from Apache Roller to PMC-based choices continues.
- TravisCI has been turned off, and Travis,Inc has been notified we
 will not renew our contract. Most projects converted to GitHub
 Actions (the two build systems are very similar).
- Continued work on Gradle Enterprise and migration to the cloud
 versions of the Atlassian products.
- New subversion server stood up.
- Put together a request for cloud credits from one of our major
 sponsors. Should hear back later this week.
- Work continues on the Agenda Tool, which should be ready for testing
 and use for the April Board meeting.
- We plan to switch from LastPass to Bitwarden due to continuing
 issues with LP. We have solid references that BW is a better
 service. We have decided to use their hosted service rather than an
 on-premise installation.
- Preparation for the Infra Team Meetup in Nashville, TN from April
 13th to the 18th.
- Implementing an autopatch/reboot process for project VMs. Build
 nodes for Jenkins and Buildbot got their own automated system.

15 Feb 2023 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Our primary budgeting spreadsheet has recently been (mostly)
updated. A few more tweaks, and we'll be able to produce a new budget
in short order. Waiting for the President's Office to begin the
budgeting process.

Short Term Priorities
- Complete LDAP upgrade
- Roll the Agenda Tool out for testing

Long Range Priorities
- Artifacts Distribution Platform
- Atlassian Cloud applications
- Gradle for the communities

General Activity
- Continued work to provide blogging options to projects, other than
 Apache Roller. One project has migrated, and four more are tracking
 their blog migration.
- Many projects have migrated from TravisCI to GitHub Actions. Recall
 that TravisCI ended their Open Source program a couple years
 ago. The Foundation has filled in since, with a service contract.
- Some work has started on dynamically spinning up Jenkins builder
 nodes, in AWS and Azure.
- Work on upgrading the LDAP server continues. This has been
 re-planned as a phased migration, rather than one big change.
- The selfserve portal is being rewritten from old-school CGI scripts
 to a modern async Quart-based server. This will provide a better
 platform for future features, and simpler maintenance/development.
- Some work on moving to Atlassian cloud-based applications. Some work
 was held up awaiting some input from Atlassian. Work should resume
 in the next month.
- Improvements in Puppet support on Windows, along with improved Maven
 tooling and JDK support.
- The Agenda Tool is getting close to availability for testing.
- Additional work on the Artifacts Distribution Platform.
- Automated quarterly reboots of builder nodes, to pick up any
 platform (security) updates.
- Getting vote.a.o ready for the upcoming Annual Members Meeting.
- Ran a successul "Infra Roundtable" monthly video meeting with
 members of the communities.

18 Jan 2023 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Short Term Priorities
- Prepare for the March Annual Members Meeting, possibly with new irc
 bot and voting tech.
- LDAP constraints.
- Shift away from TravisCI

Long Range Priorities
- Artifact Distribution Platform

General Activity
- Continued progress on the new Infra-supported Agenda Tool. It is
 nearing v1.0 release of a readonly display. (v2 will be interactive)
- Lots of JDK work for consistency across the build nodes.
- Implementation of Gradle Enterprise. Several projects are currently
 testing our deployment.
- Some early testing of Ansible, for future config management.
- MFA work and planning.
- Started work on Hugo for projects to use for website builds.
- Roundtable held on January 11th.
- Artifact distribution work continues, with some help from the
 Security Team for design requirements.
- Auto-reboot script looking great for our build nodes, and getting
 manual for manual reboots of other nodes. We're now tracking a reboot
 per quarter for every box to pick up (kernel) security upgrades.

21 Dec 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Infra ran its first "Roundtable" Slack-based meeting for users to
 hear from, and interact with, the Team. We intend to continue these
 going forward, monthly. Some good feedback on this initial meeting
 has been taken, to smooth them out going forward. Attendance was not
 what we hoped, and we hope that continuance and knowledge of the
 roundtable will see more people in the meetup.

Short Term Priorities
- idm.a.o to replace id.a.o
- New SVN server
- LDAP ACL rollout

Long Range Priorities
- Artifact Distribution Platform

General Activity
- Finalized the move of the infra blog to use Pelican. Next is docco
 and support for projects to use options for their blog via
 .asf.yaml. Goal is to provide PMCs with multiple blog options, for
 whatever works best for their community. Working OFBiz to shift
 their blog, for more experience and tool improvement. Some work to
 provide Hugo-based blogs has begun.
- Agenda Tool has made very good progress on parsing agenda files and
 presening them through a webui. This has also been connected via
 websockets for dynamic updating as new commits arrive.
- The Gradle Enterprise installation continues to proceed. An initial
 test where we figured "should be easy" .. was not. Some builds use
 Apache RAT to seek out files without an ALv2 notice, and Gradle
 dropped such a file into their .mvn directory. A Pull Request has
 been created for RAT to ignore .mvn, and until if/when that is
 accepted, then the Gradle/Maven plugin will instruct RAT manually to
 ignore that directory. Further testing will proceed.
- Work on/off to upgrade Confluence. We had a temporary problem with
 the many plugins showing a cost, rather than complementary, for Open
 Source Confluence installations.
- Met with Atlassian about moving to their Cloud-based products
 instead of our current on-premise instllations. We now have a shared
 Slack channel and are making good progress, but it is going to take
 quite a while. The Foundation is the largest F/OSS on-premise
 install by a very large margin, which is creating some new porting
- A solution to limit spam Jira accounts was rolled out, but has met
 with a lot of pushback. The Infra team might just roll this back,
 and find other ways to deal with the spam accounts.
- New backup server is fully operational.
- Many more Windows build nodes have been spun up. Tooling has
 improved the setup and operation of these nodes.
- Starting a bit of work on rebuilding our selfserve platform using
 newer technology (async/Quart/asfpy).
- Continuing testing/review of "keycloak" as a future IDP.
- LDAP ACLs will be applied in early January, after much review and
 testing based on feedback from an initial push.
- Some experiments with Ansible for configuration management have
 begun. Initially, working on assembly of an "inventory" that Ansible
 can consume, along with a playbook for periodic reboots of
 servers. Our Jenkins build nodes are now using a script to perform
 these reboots (intended to pick up needed [kernel] security
 patches). We have started a goal for quarterly reboots of all boxes.

16 Nov 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Short Term Priorities
- Begin a series of "Roundtable" meetings for regular chats with
 people from the Apache ecosystem

Long Range Priorities
- Builds: Gradle Enterprise, artifacts.a.o, TravisCI to GitHub
 Actions, and ephemeral build nodes
- Security: MFA, OAuth, idm.a.o

General Activity
- Moved infra blog to a Pelican-based systems
- Built a new BackupPC server, and separated rsync backups to its own
 server. More space, more reliability, about the same cost.
- Continued work on the Agenda Tool. We are going to plan a Sprint for
 the team and other interested members of the community. We'll
 announce something within the week.
- Work continues on the Artifacts Distribution Platform, and some
 testing from the community has begun (along with feature feedback).
- The migration of some TravisCI-based projects over to GitHub Actions
 continues at a regular pace.
- We put the LDAP schema "lockdown" on hold, when we discovered a
 bunch of unknown systems were using LDAP anonymously. We'll be
 setting a new date shortly. We upgraded the technology stack last
 month, and this lockdown is now possible, to increase privacy.
- We have established an approval process for creating Jira
 accounts, and have rolled that out to the projects. This is similar
 to what was done years ago for Confluence and commit permissions. It
 should stop the bleeding from Jira spam.
- Several team members are working with Gradle, Inc to set up a Gradle
 Enterprise server to provide projects with some additional options
 in their CI/CD processes. We are close to testing some builds.
- Started rolling out our new Slack bot, Qbot, to various channels for
 production use by Infra and some of our PMCs.
- Work has begun on migration of the svn server to a new machine with
 improved specifications and the latest Ubuntu LTS (22.04)

19 Oct 2022 [David Nalley]

Infrastructure is operating as expected, and has no current
issues requiring escalation to the President or the Board.

- Infrastructure had a Birds-of-a-Feather session at ApacheCon North
 America, which was plentifully attended by committers and
 contributors, and resulted in valuable feedback and ideas for
 furthering the mission of the Infrastructure Team.

Short Term Priorities
- Making better use of donated compute offerings.
- Explore setting up a monthly "Infrastructure Roundtable" session
 where committers at Apache can learn and discuss our current and
 future service offerings.

Long Range Priorities
- Revamped artifact/download server strategy.
- Migrating our primary subversion server to address resource
- Address the need for a central roadmap of infrastructure service
 offerings, now and in the future.

General Activity
- Started implementing a new strategy for scheduling standard
 maintenance operations on all machines according to criticality of
- Work is progressing on implementing a new identity management system
 for committers.
- Discussions on simplifying and modernizing release signing processes
 and policies are underway.
- Addressed some shortcomings with our self-serve platforms when
 dealing with non-standard ( requests.
- Configuration management modules have undergone a great deal of work
 to keep our workflow up-to-date with the current optimal Puppet
- We are in talks with the VP, Data Privacy, about upgrading the
 Matomo analysis service to an fully infra-managed service.
- Work is progressing on assessing our backup strategy and process,
 with the aim of creating a new, more optimal process that is better
 suited for our high-volume needs.

21 Sep 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- blogs.a.o/foundation/ moved to news.a.o

Short Term Priorities
- Require sign-in to LDAP connection. Future: ACLs.
- Replace id.a.o with maintainable tech.

Long Range Priorities
- the Agenda Tool

General Activity
- webmod improvements.
- oauth.a.o moved onto a new, isolated machine. This box will pick up
 some new identity/account functionality.
- Continued work on our Pelican website builder system. Reducing the
 use of .py config files, in favor of .yaml. Second goal is to move
 sites to a generic/foundation-wide builder, and avoid the ability to
 specify per-project code requirements.
- Development of a new Slack bot to assist Infra primarily, with some
 additional features for the broader community.
- Moving some tools and scripts from our classic svn repository into a
 new git repository. This enables use of the GitHub tooling across
 the Infra community. Systems that use svn for "workflow" (such as
 account requests) will remain in svn, as it is empirically less
 brittle in these types of use cases.
- Refinement of spam/sender rejection. h/t to Infra volunteers.
- Improved mail monitoring, given new systems arrangement. This
 includes a new, custom monitor agent.
- Working with Apache HBase to spec out and launch CI/CD nodes under
 their targeted donation.
- Lots of work to get Puppet to set up ARM nodes, primarily for
 workers to support our Jenkins cluster(s).
- Continued work on the Agenda Tool, albeit slow progress. The basic
 structures needed for the tool are working. Feature work is the next
 set of work items, on that substrate.
- Testing new 2FA, identity, account management systems.

17 Aug 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Upgraded to new LDAP infrastructure, based on recent tech. We were
 being held back by the old FreeBSD-based mail server. That was
 replaced last month, removing the gate on the LDAP upgrade.

Long Range Priorities
- New Artifacts platform.

General Activity
- Upgraded Confluence (cwiki.a.o), to deal with issues related to
 password resets on non-LDAP-based accounts.
- LDAP upgrade; see Highlights.
- Various fallout from the LDAP upgrade, with a few references to the
 old servers found in some projects. Infra services have all been
 upgraded and are now stable. Docco has been updated to give us
 pointers next time this needs to happen.
- The new Agenda Tool has grown functionality to listen to pubsub and
 maintain a websocket to the client. Base level bits, mostly testing
 out the async subsystems.
- Continued work on artifacts.a.o, and some coordination with the
 Security Team to discover requirements.
- More Apache HBase build/test nodes.
- Pelican work (described last month) is progressing well.
- Completed work to add another buildbot worker, as part of our
 Pelican-based content workflow.
- Lots of Artifactory work, to fix/improve our base Ubuntu 22.04
 system and better deal with our signing certificates.
- Initial support for ARM systems, and a Jenkins node.
- Our Vault proof of concept is moving along, for projects to be able
 to store secrets (primarily: for Github Public Runners).
- Initial work on a Slack bot to assist Infra with its work, in terms
 of monitoring/alerts/escalation, dashboards, functions, etc.
- Exploring a new workflow for Jira account creations, to reduce
 ticket spam.
- Considering a new mechanism for tracking our large set of JDKs,
 Maven and Ant version, etc on our build nodes.

20 Jul 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Our old mail processor ("hermes") had a severe hardware crash, and
 the cloud provider was not going to reestablish it within timeframes
 we wanted. We had a new processor ("mailgw") already running, and
 had moved about half of our mailing lists to that new system. Rather
 than try and fix the old, we moved onwards to the new using a recent
 backup of the list configs. Mail was held up for about 10 hours;
 most went through, but some bounces occurred. A few subscription
 changes may have been lost. No mail that was accepted by our systems
 were lost; all was eventually delivered and archived as appropriate.

Short Term Priorities
- Move to new LDAP servers.
- Finish Pelican turnkey work.

Long Range Priorities
- Create artifacts.a.o

General Activity
- Working on new backup system to increase throughput and storage.
- Our OAuth system has an optional 2FA element that needs more
 thinking. We've started some planning around that, enlisted another
 volunteer, and will produce a plan for comment.
- Our Pelican workflow is not turnkey, so a lot of work is in-process
 on cleaning that up, improving the documentation, and then walking a
 volunteer PMC (already identified) through the process to double-
 check our work.
- We had a single BuildBot worker assigned to performing Pelican
 builds. It went down hard in the datacenter, so we have brought that
 back up elsewhere and will construct a *second* worker for Pelican
 building redundancy.
- Heard back from HBase, and are getting them more Jenkins build
 nodes, per request.
- Working on the ASFbot, to better handle future Members Meetings.
- Work also proceeds on the Agenda Tool, for Board Meetings.
- Design work has been picked up for a new artifacts.a.o that will
 pull together many of our servers that simply provide (large)
 artifact storage and delivery (eg. releases and test results). This
 will also simplify the release workflow and provide some gating to
 ensure our release policies are followed (eg. things are signed and
 available in KEYS).
- Continued preparation for move to modern LDAP servers, now that we
 no longer have to support our old freebsd-based mail server.
- Beginning a rollout of a simple tool for web-based moderation of
 ezmlm messages.
- New website certificates installed, with a couple more this month.
- We have been running into packaging issues with puppetized installs
 for 22.04 servers, and are working through the problem with our
 service provider (who provides the apt repository).

15 Jun 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- We have completed migrating our last, old host out of Oregon State's
 Open Source Labs. Infra will coordinate with M&P to write a blog
 post thanking them, with a history of their assistance to the ASF.

Short Term Priorities
- Finish migration to our new mail gateway.
- Migrate to new LDAP server.

Long Range Priorities
- Revamped artifact/download server strategy.

General Activity
- New mail server is now fully-tested to our satisfaction, and is in
 use. We have been moving groups of mailing lists associated with a
 hostname (such as ""), and will be moving additional
 groups periodically until complete. This process should be finished
 by the July Board meeting.
- Infra helped the Chair with the Infra-side of to
 get the June meeting up and running.
- Moved to a new TLP server, in order to increase disk space. This
 "server" is the origin server behind our Fastly CDN.
- Added more disk to our Apache Subversion server, which is primarily
 growing through additional demands on artifact storage and
- As part of our management of large-storage machines, we're reviewing
 how to combine several machines (being used for slightly different
 purposes) into one or two boxes going forward.
- Work proceeds on a script to help TravisCI users to automatically
 create a (draft/initial) GitHub Actions script.
- Assisted Apache OpenOffice with upgrading translate.a.o for their
 Pootle installation to perform language translations.
- Turned off our Puppet v3 Master. There are one or two stragglers
 left, but they no longer require puppet management. This machine
 ( was our LAST physical box located at Oregon
 State Labs. OSU/OSL has helped the Foundation tremendously for well
 over 15 years, and Infra has many thanks for them.
- Expanded the set of Jenkins build nodes for Apache HBase, via the
 project-specific donation.
- Continued work on the Agenda Tool. There is no defined schedule for
 this work, but we have refocused our team member to spend more time
 moving the tool forwards.
- Planning has begun within the team for a presence at ACNA'22.
- Renewed a couple dozen domains, and the *.staged.a.o certificate.
- Confluence was down for a day, due to a major zero day security
 problem. We monitored it, and applied the fix as soon as Atlassian
 released it.
- New Grafana account to perform plugin signing for projects.

18 May 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Short Term Priorities
- Complete mail migration, then LDAP migration.
- Support Ubuntu 22.04 on our boxes.

Long Range Priorities
- Complete the Board Agenda Tool
- New artifact distribution system

General Activity
- Initial work on per-path commit emails for git
- AOO has completed VM migrations. Overall, we are nearly complete
 with the migration away from Puppet v3.
- The new mail gateway is coming up to speed, and is handling some
 projects' mailing lists now. We're fixing problems as we perform
 some moves from the old machine (hermes) to the new system.
- More machines for HBase' Jenkins cluster (via donation).
- Investigations on blogging alternatives for M&P and projects.
- Pelican website improvements in docco and code.
- PoC deployment of Hashicorp Vault for secrets management, primarily
 for GitHub self-Runners (but possibly broader usage).
- Bolt (a puppet tool) work, in relation to some of our machine
 inventory scripting.
- Jira upgrade to fix a security issue.
- Cleanup/fixes for the upgraded GitBox service.
- Designing a new platform for artifact distribution, rolling up
 dist.a.o, nightlies, archive, etc.
- Lots of disk space issues across our Jenkins clusters.
- Resolve backup issues for one of our larger boxes.
- Prepare Apache STeVe for upcoming Members Meeting.
- New Guy gets thrown into pager rotation. Heh.

20 Apr 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- GitBox was migrated to new hardware/features,
- Two of our older/expensives boxes were successfully decommissioned
- M32 (March 32nd, aka April 1st) was great fun. Thanks to all that participated!

Short Term Priorities
- Finish LDAP migration to updated/modern platform.
- Finish mail system migration from hermes to mailgw.

General Activity
- M32, our April fool's service, was by and large well received.
- Work continues on the agenda tool.
- Addressed some CVE alerts for upstream software.
- GitBox was migrated to new hardware, and the new unified account
 and repository management system (Boxer) was deployed.
- Discussing a unified distribution platform to replace the
 multitude of services we have for distributing our software.
- Working on contingency plans for our CI providers
- More self-serve features were enabled by volunteers through our
 .asf.yaml service.
- Further investigated and discussed public runners for GitHub Actions.
- Work on automating taking inventory of our machines for use with Bolt.

16 Mar 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Finally removed Buildbot 0.8, in favor of Buildbot 3.2. We have been
 running both systems for a while now, while working with projects to
 migrate their testing configurations.

Short Term Priorities
- Finish LDAP migration to updated/modern platform.
- Finish mail system migration from hermes to mailgw.
- Gitbox v2

General Activity
- Lots of testing on the new mail gateway, using a few mailing list
 migrations to the new mail system.
- Initial development of the Agenda Tool has started in earnest, and
 is progressing. Feedback and requirements from the Directors has
 been quite helpful.
- Completing some migrations, and turning off our proxmox boxes.
- Began an experiment with a webapp to manage email moderation.
- Word policing: dropping BB 0.8 removed thousands of uses. We have
 been in-process on migrating from Puppet v3 to v6 for a long while,
 which will remove hundreds of uses upon completion
- Working through some major hassles with a couple of the Jenkins
- Budget work for FY23.
- Assisting with operation of the Apache STeVe voter tool for the
 Annual Members Meeting.
- Buildbot 0.8 has been fully-retired.
- Working through an upgrade for blogs.a.o. First attempt ran into
 some issues, so we rolled back the upgrade.
- Implemented a new Datadog-based continual check for health of the
 Puppet system. It detects more failure modes than the upstream
 integration. We're considering if/how to upstream our version.
- Dealing with build nodes for specialized architectures.
- Nicely resolved some longstanding backup issues.

16 Feb 2022 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Turned off two of our oldest machines: minotaur, and baldr
 (CMS). These machines chugged away for us, for over a decade.
 They will be missed.
 Not really.

- A budget has been prepared for FY23, and presented to the President
 and EVP for presentation to the (new) Board in March. Small
 refinements are expected when January "actuals" are received.

Short Term Priorities
- Spin up "mailgw" ... the mail gateway, to replace the ancient
 "hermes" machine that routes all the Foundation's email.
- Turn off Puppet v3, and decomm our last physical box.

Long Range Priorities
- Gitbox v2 is finally "on deck"
- Quite long range: the new agenda tool

General Activity
- Exploring Atlassian Crowd as our password change tooling.
- Work has begun on a new Board Agenda tool.
- Lots of buildbot work: migration to v3.2 and Windows nodes.
- Initial deployment of Jenkins nodes for Apache HBase, using the
 targeted donation they received.
- For spam management reasons, we are testing Apache "inboxes" rather
 than forwarding.
- Word policing: p3 repository: 300+ verboten words removed over 5
 years; p6 repository: 225 removed over 2 years; website: from 31 to
 13 over the past two years.
- Cleaned out Rackspace, as they no longer offer a complementary
 allocation for F/OSS organizations.
- Much work on the backup server.
- Atlassian CLI upgrades to improve selfserve.
- Finalizing testing plan for moving to mailgw, and decomm of hermes.
- mbox-vm has been upgraded/migrated.
- Created prototype webapp to help manage ezmlm moderation requests,
 and managing the set of list moderators. This should obsolete a
 class of Infra Jira tickets.
- More project VM migrations to deal with old Ubuntu and Puppet.

19 Jan 2022 [David Nalley]

Infrastructure is operating as expected, but has one issue to escalate
to the Board:

- Two projects have been unresponsive for many months to our inquiries
 and missives to migrate them away from the Apache Content Management
 System (CMS). We have already disabled the CMS for one, and are
 hoping to hear back from the other before we disable them, too. The
 Infra team is not impacted when we disable a site -- it simply means
 their website becomes frozen, and no further edits are possible. The
 project will be impacted on the day they try to update their
 website. The Infra concern is whether these projects have enough
 energy as an ongoing project (not really our concern), or are simply
 ignoring queries from Infra (which is a problem for us).

- Our new hire began work on January 3rd.
- mail-archives.a.o and mail-private/mail-search.a.o were retired in
 favor of lists.a.o. Redirections were put in place to handle any
 outside long/short links to the old services.

- FY23 budget planning is about to begin. Infra has a pretty steady
 (if large) set of expenses, so should be able to produce a draft
 budget in short order.
- The altered payroll bumped our monthly expenditures. We anticipated
 the bump on one system, but missed another. That was easily
 corrected, and everything looks to be flowing properly. We should
 see stability after a few payrolls, which can then feed into the
 budget planning.

Short Term Priorities
- Complete the minotaur.a.o decommission (Jan 31, 2022).
- Complete the cms.a.o decommission (Jan 31, 2022).

Long Range Priorities
- Complete email system migration to our new Ubuntu-based
 system. Short-term work includes moving some @infra mailing lists to
 the new system ("eat our own dogfood").

General Activity
- Updating our onboarding documentation.
- Lots of log4j assistance, along with remediation on a few of our
 Java-based services. Some project VMs were turned off, pending the
 projects' availability to upgrade/correct those systems.
- Preparation for upgrading our LDAP servers/system. A compatibility
 problem was found, so the upgrade was deferred to February.
- Crowd is being upgraded and will provide a replacement for the service for changing passwords and basic name fields
 in the LDAP records.
- Standing up Jenkins nodes as part of a new, donated testing cluster
 for Apache HBase.
- Backfilling many archives at lists.a.o, primarily from private lists
 that had not been initially provided.
- Lots of migration of projects using CMS and Buildbot 0.8 to our
 newer systems, so we can turn those off.
- Much work on the redirection service was completed, particularly
 with the contributions of an Infrastructure volunteer.
- Continued progress on Gitbox v2.
- Migration from BB0.8 to BB3.2, including Windows nodes.
- Some migration of systems from old Ubuntu and Puppet v3, so that we
 can decommission our v3 server.

15 Dec 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- New announce@infra.a.o with auto-subscription (no opt-out) for all
 PMC members, and opt-in for all Committers. This will be used to
 announce changes in infrastructure that could be important to the
 communities' use/operation of our services.
- All TLP websites now redirect to TLS (https). There was some content
 fetched by software that does not follow redirects (eg. DTDs and
 schemas) that are carved out from the redirect to https.

November expenses are well-above normal due to accumulated staffing
expenses that were withheld during FY21, per instruction from the
Treasurer and President. These values were included within the FY22
budget, so the one-month spike should not throw off FY22 as a whole.

Short Term Priorities
- LDAP changeover to new system (scheduled: Dec 18)
- Complete onboarding of new-hire

Long Range Priorities
- GitBox v2 is starting some initial testing, but needs to be deployed
 to migrate off 16.04

General Activity
- Working with Operations' attorney to create a Confidentiality
 Agreement for US employees (historically, this was provided by
 Virtual, Inc.)
- End of year 401k review; a Resolution is before the Board.
- Standing up a Jenkins cluster of donated s390 nodes.
- Lots of Buildbot work (to 3.2, Windows, etc)
- Some impact from the AWS outage, but it eventually cleared.
- Gitbox v2 beginning to make progress.
- Gradle cache deployed.
- Beginning a new Jenkins cluster for HBase, to incorporate some
 donated build nodes.
- Some fine-tuning of NodePing, PD, and StatusPage.

17 Nov 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Offer extended and accepted for our open position. Paperwork has not
 been signed, so a name and introduction will come later.
- Our mail archive service, PonEE, updated the software to a new
 improved version of the software.
- Mirror system turned off.

Short Term Priorities
- Decommission: mail-*.a.o, minotaur.a.o, Puppet v3 (devops.a.o)
- Upgrade to new LDAP servers
- Turn off CMS, finally

Long Range Priorities
- Gitbox v2
- Migrate mail system off hermes

General Activity
- Jira, Confluence, and CCOS upgraded to new security releases.
- Rough download stats, experimental.
- Standing up a new infra-internal "dashboard" to better direct the
 team to documentation, runbooks, and data query systems.
- Buildbot configuration migrations from v0.8 to v3.2
- Planning a move to https: redirects and HSTS header.

20 Oct 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Mirror system deprecated. CDN full-time.

Short Term Priorities
- Complete interviewing of candidates.

Long Range Priorities
- Wrap up P3 migration; Gitbox V2; mail; LDAP

General Activity
- Changes to closer.lua to deal with European distribution from
 dist.a.o and turning up Fastly incrementally to test eventual
 load. Testing was successful, and now at 100%
- Large cleanups around our backup processes. Some machines have been
 moved to rsync, rather than backuppc. Evaluating some packages
 (again) for tooling improvements.
- Confluence was upgraded.
- Migration of Buildbot jobs from version 0.8 to 3.x
- Working on the final CMS stragglers.
- We have access to a lists.a.o upgraded service, and have been
 testing that. No showstoppers have been found, so we will ask for
 the upgrade to take place in early November.
- Wrapping up some Puppet v3 migrations to new Puppet.
- LDAP work. We're on a very old LDAP deployment, and will be shifting
 to a modern LDAP with more capabilities around logging and access
 control. Still working on setting up the new system, and planning on
 a migration process.
- Testing a mailbox service for addresses, rather than
 performing forwarding. This should fix some problems where spam
 email is forwarded by us, leading to's outbound SMTP
 server is flagged as a spam-sender. This could be a big change for
 our community, as we'd be providing an inbox rather than a forward.
- Testing permalink redirects from mail-*.a.o to lists.a.o, so that we
 can shut down the mail-* sites, and leave a redirector.
- Started some work on an Infra internal-ops home page to pull
 together all of our runbook and docco. A dashboard was originally
 envisioned, but the true need is pulling info together.

15 Sep 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Our use of a CDN appears to be working smoothly on picking up our
 entire traffic load (websites and downloads). This will obviate our
 historic (over 25 years) mirror system. Infra is working with M&P to
 communicate this move, and to write an article/thanks about the
 system's history. Our mirror system was transformative and supplied
 our ecosystem for decades. We do not intend to end it with a simple
 "thanks" to our many providers, so M&P is helping to create a full
 rollout and handling.

Short Term Priorities
- Fill our open position.

Long Range Priorities
- Gitbox; mail gateway

General Activity
- Lots of work on backup management: improving use of bpc, shifting
 some large systems to rsync, deployment of new backup server and
 decomm of the old server, and evaluation of new/alt systems.
- New version of auto-blocking system deployed, along with companion
 logging and new logs cluster.
- Lots of Buildbot 0.8 migration to 3.2.
- home.a.o migrated to better/cheaper system.
- The CDN satisfies all website content, and we have incrementally
 shift artifact downloads from our servers/mirrors over to the CDN.
- Critical upgrades for cwiki security.
- Planning how to test/roll-out a large service upgrade on lists.a.o

18 Aug 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Major work is being completed on migrating from Buildbot 0.8 over to
 Buildbot 3.2. We are engaging communities to help them with the
 migration. Beyond the technology/feature upgrade for our projects,
 this also provides Infra with streamlining its deployment of the
 buildbot master and the network of worker nodes.
- The TLP websites are now completely serviced by the Fastly CDN
 network. We are sending a fraction of projects' downloads to the
 Fastly network, with an eye to traffic measurement and increase the
 amount of downloads serviced by Fastly.

Short Term Priorities
- Complete the hiring process.

Long Range Priorities
- Decommission mail-*.a.o in favor of lists.a.o

General Activity
- Wrapping up CMS migrations, and improving docco on how to manage
 websites with the ASF infrastructure/workflow.
- New backup server has been deployed, and systems are in-process on
 migration. We will be able to decommission a couple backup servers
 once this process is concluded (shortly). We have more storage at
 reduced pricing, so win-win.
- The migration to Artifactory (from Bintray) is being wrapped up;
 some straggler packages have been identified and fixed.
- Blocky v4 has been mostly completed, and is rolling out. This system
 provides traffic rules for our sites, and automatically bans clients
 from all of our systems when a violation is identified.
- New GitHub action create/supported for pushing content to
- Lots of work on our mail systems, and in particular: dealing with
 spam and how next-hop systems mistakenly believe we are the source
 of that spam. We're testing several approaches to remedy this, as it
 affects final-delivery of the email we send.
- Working towards a larger automated system for tracking all of our
 hosts-of-interest inventory. We have a solid track of *our*
 machines, but there are dozens donated to the Foundation which we've
 found that we should track better.
- home.a.o has been migrated to a lower-cost provider.
- We enabled a strict setting of SPF recently, and have not received
 any pushback/reports on this (unlike our first attempt a few years
 back). This will help prevent other sites from pretending to deliver
 email from @a.o accounts, which may help settle some spam issues.

21 Jul 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- www.a.o is running on the new Pelican-based publishing workflow.
 Infra is following up with a few remaining PMCs to transition their
 websites off the PMC.
- Infra job posting is now live on
- Have begun moving TLP websites to the Fastly CDN. All sites will be
 moved to the CDN, in the next few weeks.

Short Term Priorities
- Decommission minotaur and the CMS.

Long Range Priorities
- Turn off mail-archive/private, in favor of lists.a.o

General Activity
- LDAP upgrade/migration work continues.
- Continued Puppet v3 to v6 migrations. Nearing the end.
- Finalizing and roughing out the edges of our move from Bintray to
 Artifactory at JFrog.
- Upgraded BuildBot system to BB3. In-process on a plan to migrate
 projects from BB0.8 to the new system.
- Turned off the meet.a.o experiment. The technology simply wasn't
 mature enough to create an easy-to-maintain audio/video meeting
 platform for the Foundation's projects.
- Workflow refinements to make it easier for projects to use Pelican
 for their websites.
- Our new logging cluster servers were provisioned, and are being set
 up to collect logs from our many systems.
- Work done on a database to construct mappings for old lists.a.o
 permalinks that used a prior Ponymail algorithm. Work continues to
 get this tested, to ensure we have all needed data.
- New Jenkins controller for the Maven project, to include their own
 nodes and reduce impact on the Foundation-wide build nodes.
- Deployed a new backup server. More storage, lesser cost.
- gitwcsub turned off, in favor of .asf.yaml only.

16 Jun 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- CMS migration work has basically completed (including www.a.o) and
 we are in final testing/evaluation by the project communities.
- Writing up a job description to post this month, to hire another
 Infra staffer (filling our last open headcount).

Short Term Priorities
- Complete LDAP server upgrades
- Get somebody hired

Long Range Priorities
- Turn off the CMS
- Validate our v1-format permalink database for lists.a.o, and
 deprecate all the old architecture.

General Activity
- Decided on constructing an LDAP pubsub client to watch for LDAP
 changes from our old system, and push those changes into the
 new. This simplifies the compatibility issues and will also gives us
 LDAP logging/audit.
- Pulled the switch: is now serving DNS view Route53. The
 load is much lower than we expected (important if we ever have to
 pay this bill). R53 provides much more insight/tooling that we did
 not have before, which is a nice bonus.
- The .asf.yaml system grew a new "autostage" feature to simplify
 staging, particularly when used by the pelican builds.
- The old gitwcsub.conf has been deprecated in favor of .asf.yaml
- Various tweaks dealing with the move to Artifactory.
- Performing lots of work/assist for the Foundation and projects to
 move from Freenode over to
- Setting up a template-site and infrastructure-pelican to support
 projects that want to use the default/recommended workflow for
 publishing websites. Lots of related documentation updates.
- Improved website staging features, and applied various fixes.
- Continued work on moving to Atlassian products in the cloud, via
 testing some custom conversion scripts they built for us.
- Various work on our Jitsi instance. We hope to make it stable and
 performant enough to offer the service to our communities.
- Some Jenkins nodes were donated for the Apache HBase project.

19 May 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- www.a.o has been fully-ported from the CMS over to our new
 Pelican/GFM service, and is undergoing acceptance testing. There
 are about eight more CMS-based websites to finish moving per the
 work contract.

Short Term Priorities
- Critical blocker was figured out in the LDAP migration mentioned
 last month, so this will be continuing.
- Switch to R53 so that we have an API to perform DNS-based challenges
 from LetsEnecrypt for wildcard certificates such as *.staged.a.o

Long Range Priorities
- Gitbox v2 is progressing, with steps identified for migration of
 individual components instead of all-or-nothing. We will perform
 testing with Infrastruture repositories first.

General Activity
- Planning for Infra's "presence" at ApacheCon 2021. Looks like one or
 two sessions will be proposed.
- Infra is working with a potential future service provider,
 evaluating their service for our needs. We are also providing
 feedback to their teams for future feature work.
- Atlassian meetings continue for moving to their cloud offering.
- Artifactory has been spun up, and bintray turned off. Thank you
 JFrog for the service (past and future).
- All Infra services billed via Citizen's Bank credit cards have now
 been moved, and we're monitoring this new workflow. So far: it is
 very awesome.
- InfraAdmin/Greg finally brain-dumped to Andrew a long discussion of
 the job duties, should another need to take over.
- Continued migration from Puppet v3, up to v6.
- New system for keep svn authz ACLs synchronized is now completed and
 running in production. ("svnauthz")
- Improved our Pelican/GFM build system to better-support our
 contractor for the CMS migrations. This also includes providing some
 "assists" with coding to more closely fit Infra systems.
- Jenkins work around testing ElasticSearch metrics (decided against)
 and some issues around the GitHub PR Builder.
- Brand new system for managing and displaying fingerprints for all
 machines managed by Infra. Lots of stale machines were cleaned out,
 along with somewhat-related DNS entries.
- Some assistance to Sam for the Agenda Tool on agenda.a.o.

21 Apr 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Contractor hired to assist with CMS migrations for the primary
 www.a.o website, and for projects that do not have time, energy, nor
 experience to perform those migrations.
- Excellent financial shape.

Infrastructure has landed very below budget for four primary reasons:
we did not fill our open headcount, we did not conduct a team F2F
meeting, the "credit card" discretionary fund was overbudgeted due to
historical comingling of hosting charges to the card which are now
separate line items, and reduction of hosting costs by shifting
services to lower-cost providers.

For FY22, we plan to fill that headcount, and the budget has been
finer-tuned to the actuals from FY21.

Short Term Priorities
- Provide support to the CMS migrations contractor, and finally
 decommission the CMS service and its aging hardware.
- LDAP migration.

Long Range Priorities
- Turn off the old mail-* services in favor of lists.a.o

General Activity
- Our puppet server has been migrated to a new set of machines in AWS,
 increasing the throughput and responsiveness of Puppet.
- Much work on www.a.o content.
- Gitbox v2 is progressing nicely, with a recent focus on account
 linking between the ASF and Github.
- Many meetings/work with Atlassian to plan our move from self-hosted
 products to their cloud offering.
- JFrog has deprecated their bintray service, and have kindly offered
 to host us on their Artifactory service instead. It should be about
 the same, functionally.
- New daemon for processing the svn authz files.
- Various ElasticSearch work: copy of data from lists.a.o, setting up
 a new cluster with newer code and better hardware for cheaper, and
 looking at Amazon's new OpenSearch release.
- Integrating a huge number of new Jenkins nodes for Cassandra.
- Shifting Infra billing systems to our new TDBank/Ramp service.
- Considering a turndown of our/donated DNS servers, in favor of R53.
- Migrating everything from gitwcsub to .asf.yaml only.

17 Mar 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

There is nothing to highlight beyond the budget that Infra has
submitted to the President/Operations.

Infra has submitted a draft budget to the President for presentation
at the March meeting of the Board.

Short Term Priorities
- Complete transition to new Puppet setup

Long Range Priorities
- Finish our hermes mail system migration

General Activity
- Constructed new Puppet system in EC2, across several machines, to
 improve responsiveness and speed
- Moving from Bintray to an Artifactory instance, sponsored by JFrog
- New service for simple exec when a commit happens
- Refinement of red/orange/yellow status issues, with related paging
- Work on meet.a.o for our communities to conduct meetups.
- Infra contributed support for the Annual Members Meeting
- Doc updates across infra.a.o
- Lots of "aardvark" work. This is a honeypot to capture spammers and
 subsequently lock them out of our systems.
- Initial work on selfserve for mailing list moderator changes

17 Feb 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- We were advised of a "sudo" vulnerability by the Security Team. Most
 of our machines do not have general access, and the users who can
 access them tend to have sudo. But with that said, we applied the
 corrective patches within hours across our entire set of hosts.

- Budget process is beginning.

Short Term Priorities
- Roll out a replacement for JFrog's Bintray. This will likely be a
 cloud-hosted Artifactory system (from JFrog), though we also have a
 Nexus v3 license from Sonatype. Evaluation continues.

Long Range Priorities
- As usual: email and gitbox v2.

General Activity
- LDAP schema and definition/interaction changes, to deal with various
 types of naming across cultures.
- Investigating sign-in via Okta for future cloud-based services at
 Atlassian. Possible use of Okta within our own services.
- Performance issues related to backup, proxmox, puppet, and others.
 We continue to tune to keep the services smooth, and to align our
 needs with the appropriate underlying box.
- Improvements on what to backup, and general tooling around keeping
 disk usage within desired constraints on build machines, and our
 mail systems.
- Stood up a new TLP server, and dealt with a number of
 synchronization and update problems on the new server and old. Some
 tooling received updates for more bullet-proof operation. All TLP
 servers are now on Puppet v6.
- A large amount of doc updates and clarification around GitHub,
 Gitbox, and our use of git at the Foundation.

20 Jan 2021 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Staff is now completely on our new PEO, three payrolls have been run
 successfully, and our benefits plan started on January 1. The 401k
 plan should start on March 1.
- A potential security gap in GitHub Actions led to our disabling of
 "untrusted" third-party Actions until we could investigate. We are
 on-track with security recommendations and a scanner to enforce that
 policy, as people adjust workflows [which use Actions].

- Budget planning will start very soon. Staffing represents over 85%
 of the Infra budget, and that was completed during our PEO
 transition. The remaining portion of our budget will be very
- Overall, the Infra budget is doing well for FY21, as we did not
 backfill our open position (and had budgeted for a half-year).

Short Term Priorities
- Finish the FY22 budget.
- Gitbox v2 launch, for a few private repositories.

Long Range Priorities
- Finish the mail system upgrade. It is in testing right now.
- Extract permalinks from our PonEE snapshot, and validate them to
 ensure we have a disaster recovery plan that includes permalinks.

General Activity
- Helpful edits for community.a.o and incubator.a.o
- Continued migration from old ubuntu/puppet.
- Gavin led another "Builds Meeting", with two people from GitHub in
 attendance to answer questions, and to take concerns, around the use
 of GitHub Actions (per the highlight above).
- Beginning to explore Okta as an identity platform. It will be used
 for when we migrate to the Atlassian cloud products. We may be able
 to use Okta in other areas of the Foundation.
- Ported the Buildstream repository (and other supporting data, and
 metadata) from Gitlab over to GitHub.
- Some volunteers are assisting with Jira administration, and JDK
 support via bintray.
- Beginning testing of GitHub's Container Registry, as their old
 package repository is deprecated and causing some issues.

16 Dec 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- We have signed all the paperwork with our new PEO, and will have
 completed onboarding our US employees by the Board meeting.
- Infra has started a monthly "All Things Builds" conference call,
 which has shown good interest. It allows for cross-connecting
 between projects.

Short Term Priorities
- Finish testing of the new mail systems.

Long Range Priorities
- GitHub v2: on recent Ubuntu and Puppet, more reliability, improved
 codebase, etc.

General Activity
- Mail server rebuild is progressing well, and we are performing
 various tests before moving more of the Foundation onto the new
 system. Individual mail routing is working fine, and testing some
 mailing lists is on deck.
- Continued some fixes on the account creation script.
- Initial work on moving from (ancient) Buildbot 0.8 to Buildbot 2.0
- Moving projects from (deprecated and shutting down) to
 our paid-for service on
- Preparing a meet.a.o server for Apache audio/video meetups.
- Working/meeting with Atlassian on product issues.
- Wrote an analyzer for people who mark mailing list messages as spam,
 so we can track them down and unsubscribe them.

18 Nov 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- PEO is progressing well, and should have an update around or soon
 after the Board meeting.

- Given that we have not rehired for the open position, we are well
 within budget for FY21. Finances will be a little unpredictable as
 we settle in with our new PEO firm.

Short Term Priorities
- Resolve PEO issues for the US-based staff.

Long Range Priorities
- Finish hermes (mail) migration. Almost there!

General Activity
- Testing 2FA within Crowd, specifically to require for root. Goal is
 to use Crowd for LDAP password changes (and retire id.a.o).
- Continued CMS and VM migrations.
- Uploaded historical ID mappings for lists.a.o for exploration.
- Progress on generating metadata for TLP, retirees, and fingerprints.
- Renewed work on configuring a CDN.
- Beginning moves from over to
- Mirror work, due to Chrome deprecation of plain http: and ftp:
- Account creation revamp, along with supporting changes to asfpy.
- Gavin's "Inside Infra" is now live.
- Jenkins work on capacity, website generation, plugins, bintray,
 better job handling, etc.
- DockerHub fixes to deal with their upcoming rate limiting.
- rsync.a.o filled up, so we quickly rebuilt it.

21 Oct 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- "Attended" ApacheCon@Home so the staff could meet with the
 community, and they could meet us.

Short Term Priorities
- Continue some improvements on our DNS systems for initial project
 setup, to better-manage retired projects and podlings, and AWS R53.
- Stand up GitBox v2.

Long Range Priorities
- Get all projects off of the deprecated CMS service.

General Activity
- Jenkins' server had some bad RAM, so we had to move it
 unexpectedly. Thanks to our work with Puppet, this happened
 easily and quickly.
- Some projects have started with their migrations off of the CMS, and
 providing feedback on our migration documentation.
- Project and Infra VMs are continuing their migrations to Puppet v6
 and Ubuntu 20.04. The spam servers we had in PNAP have moved over to
 Hetzner, which freed up lots of space for projects VMs within PNAP.
- We stood up a new TLP server on v6/20.04. The other two TLP servers
 are on p3 and 16.04.
- Finally decommissioned the Fisheye service (unused).
- Lots of Jenkins work with upgrades, the move, adding more donated
 nodes to our various masters.
- GitBox is on an old p3/16.04 VM that needs to be migrated. We are
 doing this a piece at a time, to ensure that we don't break this
 critical service. Some redesign is occuring, so we call this v2.
- Documentation work continues, include a new "Infra 101". Andrew has
 been working with M&PR and Central Services to assist with the ASF's
 main website.
- Initial testing has started a shared 2FA solution for role accounts.

16 Sep 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- CMS out-migration has been in earnest. This service has been
 deprecated for many years, we have several alternate solutions for
 the projects, and our documentation for website building has been
 improved, to support the migration.
- All Jenkins nodes are now managed via multiple CloudBees Operations
 Center instances, providing better service to the projects with
 "their own" Jenkins nodes. The shared cluster is now managed this
 way, too, and all jobs moved over from the old shared cluster.

Short Term Priorities
- Finish GitBox v2, in order to use our newer platforms/services.

Long Range Priorities
- Migrate all projects away from the CMS, and turn it off.

General Activity
- Many improvements around "pipservice" tooling, and our deployment of
 those tools to all our machines. Some tricky work was needed to
 uninstall pre-pipservice tools.
- Turned up our usage of the AWS Route53 DNS service for the few
 domains which are not served directly by Namecheap (most are
 redirects to $ The expected increase in traffic did
 not materialize, however, so we may have already been serving the
 bulk of our DNS queries via R53.
- Our mirror network is now served via IPv6, where available. Most
 mirrors are served via TLS (a future requirement of Chromium when
 downloading content).
- All projects have migrated away from Ubuntu 14.04, but many are
 "in-flight" on migrating away from 16.04. We are migrating them to
 Ubuntu 20.04 (LTS) and Puppet v6, mostly within the PNAP
 datacenter. Infra has some older machines which are lower priority
 to migrate.
- Our spam-processing infrastructure has been upgraded to 20.04/v6, as
 part of our continuing process to modernize our mail subsystems.
- Lots of supporting work for the shift to Operations Center, such as
 account credentials, and using GitHub Apps to trigger builds. This
 work creates a more solid system and avoids some GitHub API limits.
- Additional support for the new nightlies.a.o service.

19 Aug 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Short Term Priorities
- Contact all communities still using the deprecated CMS service, and
 get them onto a plan for migration.
- Turn on Route53 for the bulk of our DNS provision.
- Move project VMs into PNAP.

Long Range Priorities
- Turn off the CMS.

General Activity
- Our email "outbound" IP was assigned to a new machine, as we
 continue to reconfigure our email systes/network. This removes more
 functions from our old/outdated box, which simplifies the ongoing
 work to upgrade/migrate to our standard Ubuntu layout.
- While performing the email systems upgrades, we have also up'd our
 game with SPF records (spam prevention) with the huge assistance of
 one of our community members.
- Lots of Jenkins work to migrate communities to our new clusters.
- Launched a new nightlies.a.o service for (large) transient files.
- Gitbox "v2" is in progress. Some services are moving from the v1
 system over to the v2, to minimize impact when the "big switch" is
 pulled. We're already using AWS' SQS to ensure that we don't lose
 data during any such migration.
- Our pubsub system has been upgraded to enable subscriptions to
 changes within private/protected repositories.
- Many migrations from older systems to Ubuntu 20.04 and Puppet v6.
- Solved a security issue within our download page scripting by
 tightening it up. At the same time, we were able to finish the
 deprecation of download assets from our primary TLP servers, and
 shifting them to our download server.
- Continued work on website publishing workflows. Notably, adding
 Jekyll as an option for communities. Hugo is also being explored,
 although we have not moved that into "official support".
- Our mirror system has been made fully IPv6 ready, including
 proper geolocation features and supplying IPv6 visitors with
 a list of IPv6-enabled mirrors.

15 Jul 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Much of the work this past month has been working on improving our
 basic systems (see the many details below), to provide services to
 our projects. There is no key feature to point out beyond "better".

Worked with the Treasurer's office around our deployment, and
testing their API system. The goal is to use the API to extract
month-end data for long-term archival and data ownership. This
workflow is not strictly about Infra, but occurred as part of the
Operations in assisting.

Infra has been able to correct all of its payment flows with
and has completed a final transition of its Accounts Payable to that
platform. The workflow has been improved for our service providers,
for the approval process, and for the final payment process. Small
refinements will be tried, related to currency conversion and wire
transfer improvements.

Short Term Priorities
- Consolidating Jira/Confluence accounts into Crowd, and updating the
 account selection process to avoid committers have preferential
 rights on "availid" names relative to the non-LDAP accounts.
- Deploy a new Jenkins Master. The current server is experiencing
 occasional warnings on its disk, and it is over-provisioned now that
 we've moved many systems off the shared-server to their *own*
 clusters.  Our new provisioned system will be at least as capable,
 and at half the cost.

Long Range Priorities
- The team is getting closer on moving our email framework over to
 Ubuntu 20.04 and Puppet v6. This should be completed this year.
 Given the critical nature of email, we've been taking this *very*
 gradually and testing pieces. We're already running a mix of "super
 old" systems and "rolling out two weeks" systems.
- Move gitbox.a.o to a newer Python3, Puppet v6, Ubuntu 20.04
 deployement. Lots of moving parts. The team is working to untie as
 many as possible, to make "pulling the switch" less scary.

General Activity
- One of our staffers had traveled overseas, and then blocked from
 returning home due to the pandemic.  He was able to return in June,
 and remains healthy.
- Beginning some work on a Slack 'bot to perform various ASF functions
 for Infra, projects, and contributors. This led to a very helpful
 advancement in how we run Python services across our systems. Our
 core Python-based services (KIF, Blocky, Loggy) have been ported to
 Python 3.x and this new service-provision workflow.
- Beginning rollout of Ubuntu 20.04 (LTS) systems. One particular note
 of hope is the inclusion of Apache HTTPd's mod_md in the base
 deployment, which will assist projects with LetsEncrypt certs.
- Apache Beam has migrated their donated nodes from the shared Jenkins
 Master over to a new Infra-provided cluster, so the project can
 refine their usage/control of those donated nodes.
- Many upgrades/migrations of old Ubuntu and Puppet v3 deployments to
 our new Puppet v6 and Ubuntu 18.04/20.04 systems. This process has
 stepped up its pace over the past month, and the Infra team is
 coordinating with many projects, regarding their VMs.

17 Jun 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- looks like a solid replacement for our archaic svn-based
 payment/invoice workflow. The svn workflow grew out of our tech
 mindset, but is inappropriate for the Foundation at this point in
 our evolution. is providing needed controls, auditing, and
 archival of our payments/invoices. Operations/Treasurer is working
 through the process adjustments with Virtual.

Long Range Priorities
- Migrate away from mail-*

General Activity
- Working through some backup issues as the amount of content we're
 managing has increased over the years.
- Lots of work/completion on p3->p6 migrations.
- Crash-moved a service after it died horribly. There are a few other
 critical services in that datacenter which we will move in a
 planned fashion.
- Apache CouchDB is testing GitHub Discussion with great results.
- Continued development with
- Widespread issues due to an upstream/intermediate certificate
 expiration. New chains had to be deployed. Still working through
 this issue with a 3rd party service.
- Confluence has been upgraded.
- Supporting M&PR with an "Inside Infra" interview and annual report.

20 May 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

We have no specific highlights for the Board this month, beyond the
Finances section.

Per instructions from the President, and advisement from Treasurer,
Infra has trimmed its requested budget for FY21. Two particular
changes: holding off on backfilling an open headcount, so our Staffing
reduction reflects this delay; reducing Travel to represent only our
staff meetup, and skipping an ApacheCon North America for FY21.

Short Term Priorities
- Complete our migration to R53 for DNS management, to enhance
 reliability and management improvements.

General Activity
- Documentation balance between www.apache and infra.apache is nearly
 finished, with all infra-related docs moving. Work continues on
 clarity, updating, and expansion.
- Continued work on our email subsystems, particularly around the
 spamd servers.
- Ran a series of Infra-related payments through This
 delivered some wins and losses. Infra/Treasurer are resolving.
- Migrating many Jenkins nodes from the shared Master over to the
 project-specific Masters running under our CloudBees Operation
 Center setup.
- New .asf.yaml feature so that projects can manage their github
 notification deliveries. This identified several projects that were
 misconfigured and dev/null-ing some notifications.
- Part of the above work: new github mailer work to combine and thread
 commit emails.
- archive.a.o has migrated to a new, larger system within Hetzner for
 better bandwidth costs (and more disk space, for a fraction of the
 price of the prior system).
- pubsub improvements to monitor LDAP changes, and to source events
 from AWS SQS. We deliver all github events to SQS so that we can
 manage our servers without losing events -- when our servers come
 back up, they drain the SQS queues.
- Purchased and replaced several TLS certificates.
- Renewed effort to improve our configuration management:
 - Move our older hosts off Puppet v3, over to Puppet v6.
 - Initial testing of new automation/deployment systems.

15 Apr 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Many finance-related items: budget, pause on hiring,
- Working on a starter guide for website build and publish.

- The budget has been submitted for FY21.
- We should get back the bulk of our flight expenditures, due to
 government policies for airlines when a flight is canceled. Our
 current exposure is about US$2500, but should reach zero after
 discussion with airlines.
- We have one open headcount, but have decided to put a hold on
 backfilling the position until we get a better sense of the
 pandemic's effect on the world economy and our sponsors' ability to
 continue helping us.
- Working with Treasurer to stand up The initial work looks
 very good, and (optimistically) we can start using it for issuing
 some payments. We will continue after the Easter weekend.

Short Term Priorities
- Complete the project-specific Jenkins masters standup.
- Fix Docker security/permission support.
- Using the CI/CD statistics, trim failing builds and work with
 projects to improve their utilization.

Long Range Priorities
- Move to new payment workflows.
- Portions of our mail system have been moved to 18.04/p6, but the
 rest needs to be migrated.

General Activity
- Continue documentation work, specifically around www.a.o/dev/ pages
 which need to migrate to infra.a.o.
- TravisCI was contracted for another year, at the same rate.
- Provided support for vote.a.o for the Annual Meeting.
- Improving Docker support and security on our Jenkins slaves.
- Changes to s.a.o to increase its capacity. The Annual Meeting
 created a large load, due to many links/uses for documentation and
 processes around the meeting.
- Added "stale" ticket insight on our Jira SLA page.

Uptime Statistics
The TLP servers were having stability issues, but some parameter
tuning has stabilized their operation. It does not appear that these
created any impact on our communities or downstream users.

18 Mar 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- The Infrastructure F2F in Nashville has been canceled. Our
 accomodations have been fully refunded. It is unclear whether
 flights will be refunded; we have spent about US$3250.
- One of our staff departed the Foundation on March 13, 2020. We will
 begin the hiring process to backfill this position.
- The Board requested specific reporting upon our CI/CD services. We
 have pulled together a page to gather insights:
 We will continue to refine this page, specifically to delineate
 projects using their own build nodes vs those of shared nodes.

We submitted a revised budget to the President, incorporating a new
line item for the service which runs lists.a.o. This expense had
historically not appeared within Infra cost accounts, so we missed
carrying this forward into the FY21 budget.

Short Term Priorities
- TravisCI renewal contract.

Long Range Priorities
- Continued CI/CD capacity growth.

General Activity
- blogs.a.o was migrated to 1804/p6, in coordination with M&PR.
- Groundwork for switching to AWS Route53 for DNS management.
- CI/CD statistics gathering and reporting.
- Working with our mirror provides to move them to https: if possible,
 in light of Chrome eventually disabling downloads from http: sites.
- Continued work on new Jenkins masters for project-specific groups of
 build nodes.
- Stood up pubsub.a.o as a new service to coordinate various actions
 across the Foundation. It is intended to replace the ad-hoc systems
 in use today, under a single cover. This will incorporate git/svn
 commit notifications, website publishing, PR/issue updates from
 GitHub, and more. Future work will incorporate better resiliency,
 replay, monitoring, and insights which are missing from today's
 variant systems.

19 Feb 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Added Jekyll as a standard website builder. This is the default
 for GitHub Pages, so many people are familiar with it, and have
 been asking for it. This is provided via our ".asf.yaml" system.
- The certificate on our LDAP servers expired, causing havoc. We
 quickly generated a new cert and propagated that to the servers
 and clients needing it (including our vendor who operates the
 lists.a.o archive).
- Chris Thistlethwaite wrote a blog post for the "Success at
 Apache" series being produced by Marketing & Public Relations.

- The FY21 Infra budget has been developed and submitted to
 operations@ for inclusion into the overall budget.

Short Term Priorities
- Expand our agenda, for the team meetup.
- Get downloads.a.o running (see below)

Long Range Priorities
- Turn off hermes, and get our mail infrastructure on modern systems.
- Rackspace may be ending their Open Source support program, likely
 at the end of 2021. We have no critical services operating there.

General Activity
- JMX monitoring has been added for deeper insights into our
 Java-heavy services (Jira and Confluence).
- Decided on time/location for Infra F2F: March 29 to April 3rd,
 in Nashville, TN, USA.
- Infra documentionat improvements are in-progress for three
 primary areas: www.a.o/dev, infra.a.o, and the INFRA cwiki.
- Project-specific Jenkins Masters are getting set up and tested,
 using CloudBees Operation Center for managing them.
- Several Infra people attended FOSDEM, and had good chats with
 community members and a couple of our vendors.
- The Apache Roller instance for blogs.a.o has been updated.
- Looking at "rundeck" to improve our tooling.
- The TLP servers have been failing. We are going to move the
 downloads off of www.a.o/dist/ to a new downloads.a.o server,
 as a way to stabilize the *.a.o websites.
- Two projects were identified as improperly using our primary
 servers to download distribution artifacts, contrary to our
 published policy (ie. use the mirror system). Given the problems
 that we've been seeing on the TLP servers (above), we notified
 the projects to fix their distribution points.

15 Jan 2020 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Hired Andrew Wetmore as a part-time Technical Writer/Editor to
 review, refine, and manage the 20 year pile of organically-grown
 documentation on our websites and wiki.
- mod_mbox based archives (mail-private.a.o and mail-archives.a.o)
 have been officially deprecated, in favor of lists.a.o. An internal
 announcement was made, feedback has been received, and our current
 plan is to decommission those services in mid-to-late February. The
 larger email to committers@ will occur around January 20, 2020.

- The budget process will be starting within the next month or two.
 Infrastructure does not foresee any significant changes from prior
 years, nor the five-year forecast.

Short Term Priorities
- Complete the decommission of mod_mbox, which will require some list
 backfill at lists.a.o, and archive management on mbox-vm (our
 long-term, internal mbox storage system).
- Get our team meetup in Nashville, TN, USA planned. We are shooting
 for end of March, or early April.

Long Range Priorities
- Complete migration away from p3, especially for the nodes still
 using Ubuntu 14.04 (which saw its EOL in 2019).

General Activity
- An updated automated port-blocking system ("blocky3") has been
 rolled out, replacing blocky2. This system uses log-scanning to
 detect abuse, then blocks those clients. blocky3 introduces some
 scalability features, leading to better responsiveness.
- Assisted Sam with setting up his Loomio demonstration instance,
 currently residing on debate.a.o
- Per our Targeted Sponsorship from CloudBees, we have installed the
 Operation Center to closely manage Jenkins masters. Apache CouchDB
 was the first community to use this, and is finding it useful and
 productive. We have a couple new rollouts planned for the Apache
 Beam and Apache Hadoop projects, each of which has a set of build
 slaves assigned to those communities. It is presumed that the
 CloudBees install will allow us to better tune/allocate the nodes
 for those communities.
- Stabilized svn-master.a.o after its move to p6/18.04.
- Some minor mirror system improvements, to add https: support, along
 with some documentation for better support.

Uptime Statistics
- Downtime occurred across many services on Friday, January 10th. The
 core issue was an automated update to libgnutls which invalidated a
 certficate chain on our LDAP servers. We had to create a new cert
 with a valid chain, load that onto our LDAP servers, and then
 propagate that cert to all LDAP clients (many), along with updating
 some Java certificate stores. The overall impact was minimal for
 "downstream" since authentication is not required for public access;
 the services needing authentication (eg. Jenkins) were up/down for
 various periods across a span of four hours.

18 Dec 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

- Likely to hire a Technical Writer/Editor this month

Short Term Priorities
- Complete LDAP configuration changes.
- Solve httpd stability problem.

Long Range Priorities
- Complete the work on transitioning our email archives, and the email
 delivery systems.

General Activity
- Testing CloudBees Core for some Jenkins features, notably requested
 by the Apache CouchDB community. The system may be useful for other
 communities, particularly to fence off their allocated/donated job
 nodes to that community.
- Standing up some ARM Jenkins nodes.
- Getting Pootle in line for AOO, along with investigating other
 online collaborative translation tools.
- Upgraded Confluence.
- Testing some new GitHub features for our projects' usage: Pages, and
 Actions (along with Secrets for the Actions).
- Turned off our SonarQube server, in favor of the donated services
 from SonarCloud.
- New OTP command line tool for sudo access.
- Testing automated redirects for mail archives over to lists.a.o
- Some work to ensure svn authz is functioning correctly, and to
 simplify ongoing management.

Uptime Statistics
- Our Subversion server has seen errors with "stuck" processes. We
 have seen similar behavior on our TLP website servers, and the
 Bugzilla server, so it appears to be due to the particular usage and
 configuration of httpd. We've spent a good amount of time trying
 different approaches of traffic management, configuration,
 reporting, and analysis. We are optimistic that it should return to
 stability shortly. It appears the root cause was our migration to
 the platform in early November.
- Our DNS provider was pointing at our old hidden master. We are
 investigating an alternate solution for DNS, to remove this reliance
 and problem vector.

20 Nov 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

We are engaging a third-party to assist us with translations. This
should provide projects with a translation/crowdsource tool for their
translations. We're currently evaluating, with optimism.

Short Term Priorities
- Ensure stable backups after several disk failures/swaps.

Long Range Priorities
- Finalize a service for our projects' translations needs.
- Move qmail off our archaic fbsd servers to a new Ubuntu/Puppet install.

General Activity
- Migrated/upgrade our Subversion service to a new server.
- New features in the ".asf.yaml" service; several provided by Bryan
 Ellis (an infra volunteer).
- Our translation service/VM has been incredibly difficult to upgrade.
 We are testing a new, outsourced service for performing translations.
 Apache OpenOffice is the primary user, so we are working with them
 on testing and evaluation.
- A couple disks failed on our backup server, but thankfully not at
 the same time. Our service provider is *very* fast with disk
 replacements (about 15 minutes), but then it took several days each
 time to "resilver" the RAID array. We are stable now, and will
 monitor the system over the next few weeks. During this process, we
 took a few extra steps at our cloud providers to create snapshots
 and replications "just in case". We'll unwind those additional
 precautions once we see our needed stability in the RAID array.
- Working with a potential service on our CI/CD. A new Jenkins Master
 has been stood up, and testing is progressing.
- Working on moving many qmail/ezmlm configuration files into svn for
 recovery/audit purposes.
- Nearing turn-down of our local Sonar analysis install, after moving
 many projects to the hosted SonarCloud service. Several new projects
 have signed up for the service.
- Implemented a new frontend component of our centralized blocking
 service (blocky) that explains bans to affected users and guides
 them through the unban process, should they request such. We hope
 this new process will lessen the communication load when people find
 themselves blocked on Apache services.
- Improved the infra-provided website statistics[1] available to all
 projects, including adding a human-readable YAML version of the
 various sheets of statistics.


16 Oct 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

Three major source control initiatives:
- The subversion server was upgraded to 18.04/p6 and moved into a new
 data center. The new server is much more responsive, due to faster
 disks and an upgraded svn.
- Restored the service that maps svn projects over to readonly GitHub
 projects. We dropped the unused/obsolete mirrors, and reestablished
 (28) mirrors. The mapped git repositories are no longer published on
 an server, but remain private on the new svn server.
 Users should access the git mirror/repository from GitHub.
- GitHub commit webhooks are now delivered to AWS SQS, instead of our
 server, for resiliency (eg. when we upgrade the box) and for proper
 ordering (eg. when large payloads processed after smaller, later
 payloads). Gitbox polls the SQS queue, in order, removing payloads
 after processing.

We have a large payment this month that was deferred from FY19, but
due to cash accounting will be marked as FY20. On balance, we were
under budget in FY19, and _may_ be over in FY20. Starting in FY21,
this line item will be zeroed.

Short Term Priorities
- More 14.04 upgrades to 18.04/p6.

Long Range Priorities
- Finish deprecation of the CMS, and assisting projects to move to new
 content solutions (including possibly-outsourced solutions such as
 GitHub Pages and/or Wordpress).

General Activity
- Jira now uses LDAP for committer sign-in, and non-committers create
 independent accounts.
- Testing an easier 2FA system for Infra's use on VMs.
- Improvements to the new .asf.yaml system, including better Pelican
 support for website building.
- Exploring some new/additional options for more CI/CD. The projects
 have a boundless need, so "more" is always Good.
- Enabled TLS on our inbound MX servers. Work continues on updating
 our entire email infrastructure, from our oldest systems.
- Buildbot 2.0 is up and running, and a few jobs have been moved.

18 Sep 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

The Infrastructure team met with the community during ApacheCon NA in
Las Vegas. This provided a good chance for the team/community to meet
each other, and to address specific queries from the community.

Launched new .asf.yaml self-service features[1], with some immediate
pickup and all-around happiness.

We lost a disk at one of our data centers, affecting the
service. We have decided to deprecate much of that host's functionality,
and are curretly rebuilding svn-to-github mirrors, with some limits.

Short Term Priorities
- Migrate our Ubuntu 14.04 systems up to 18.04 and puppet6.
- Complete the svn-master migration.

Long Range Priorities
- Our Ubuntu-based replacement of "hermes" (our primary MTA) is
 looking great. But we still need many months to perform an orderly
 transfer and rebuild of the service. This has been in-process for
 very many years.

General Activity
- Stood up haproxy to spread outbound SMTP across several IPs, in
 order to "warm them up". This will assist with future failovers and
 system migrations.
- New backup system has been implemented, as a default for all
 puppetized systems. The several outliers already have backups, and
 others required various levels of maual work. This new work is to
 simplify and ensure future backups-by-default.
- Documentation consolidation, and revamp.
- Continued migration to 18.04/p6. We are focusing on the oldest 1404
 systems for migration, but will upgrade any system once it comes up.
- Major work across the Jenkins nodes, with updates and new packages.



21 Aug 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.

We experienced email delivery issues for a couple days due to a
misconfiguration on our server. That was corrected, and a variant
appeared some days later. This would not normally be an issue to
highlight to the Board, but in this case it affected our delivery to
Virtual Inc, our service provider for finances, fundraising, and
conferences. These three topics are particularly acute leading into
ACNA19 next month. Our apologies to Virtual, in specific.

Infra has coded up some special warning systems to detect similar
problems in the future, and applied additional monitoring should this
scenario reoccur under any other adverse condition.

Second: we completed the datacenter exit mentioned last month. This
occupied the team's highest priority for much of July.

Short Term Priorities
- Migrate and upgrade Jira on Sunday, August 18th (should be finished
 before the Board's meeting).
- Warm up a new IP address for outbound email. This includes an
 improved email system configuration for reliability, and to avoid
 future delivery problems noted in our Highlights.
- Restart our process to hire a Technical Writer/Editor.
- ApacheCon North America, in Las Vegas.
- Move svn-master.a.o onto a modern VM (18.04/p6).

Long Range Priorities
- Deprecate the old Content Management System (cms.apache) that is
 running on old hardware, and unmaintained software. We are exploring
 "turnkey website publishing" for the Foundation's projects.

General Activity
- Moved our primary backup server off a failing machine onto newer
 hardware with more storage, at a lower cost.
- Many migrations from our old datacenter, primarily into Hetzner. At
 the same time, we upgraded to 18.04/p6.
- Completed the migration of moin wiki content over to Confluence,
 enabling us to turn off MoinMoin and its host machine.
- Improvements in our account creation tooling, along with some LDAP-y
 goodness in our Python support library.
- Lots of work on our underlying email systems, towards our migration
 to a modern, fully-puppetized system (and away from our old FreeBSD
 setup that is difficult to maintain).
- Lifting our runbooks off reference.a.o onto cwiki, and refining them
 to modern-day actuals. ref.a.o is deprecated.
- Moving off our own SonarQube installation to SonarCloud (a
 complementary service for F/OSS projects)
- Fixing up the fingerprint system on www.a.o/dev/machines.html

17 Jul 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues
requiring escalation to the President or the Board.


Many Foundation projects use TravisCI for their continuous build/test
system, rather than (or in addition to) our Infra-managed Jenkins and
buildbot systems. Infra has purchased additional capacity for use by
our projects, but we continue to overtax our Travis resources. This is
partly due to a typical tragedy of the commons: projects don't
understand how much they are using (and thus, denying to others). We
found just three projects were using over half of all our capacity.
Lack of APIs and detailed metrics from the Travis system have hampered
work to provide detailed feedback to our projects.

We are investigating alternatives for outsourcing build capacity. Some
suggestions on our builds@ mailing list have been provided, and a
couple targeted sponsorships are being investigated.

Short Term Priorities
- Complete our travel planning to attend ACNA 2019 in Las Vegas. The
 entire team will be attending this year.

Long Range Priorities
- Complete the migration to Ubuntu (LTS) 18.04 and Puppet 6 (P6)

General Activity
- Many migrations to 18.04/P6 have been performed. We are
 concentrating on the oldest systems running Ubuntu 14.04
- The team spent about two hours responding to a DMCA request. More
 information is in the Legal Affairs report.
- Emptying one of our datacenters, by migrating all services/VMs to
 other datacenters (primarily Hetzner).
- Certificate updates for our web sites, and several internal systems.
- Preparing moves for our Subversion server, for Jira, for our primary
 mail server, and for our mail archives. This entails lots of puppet
 work and testing, before "pulling the lever".
- Small in-place upgrade for Jira, to resolve a security issue (which
 didn't apply in our configuration, but just to be safe...).
- Automatically applied topics to all GitHub repositories to group
 them by project (h/t David Blevins), and to carry their DOAP settings
- Resolved a security issue within our Jenkins build environment.

19 Jun 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

- We are starting a second CDN trial, which is looking promising

Our FY20 budget has been submitted to the President, for rollup to the
Board. It appears in this month's agenda for review, modification,
and/or approval.

Short Term Priorities
- Finish planning, and migration to a new svn server.
- Upgrade our backup server ("bai") to newer hardware and more storage.

Long Range Priorities
- Move our mail server to 18.04/P6

General Activity
- Migration of projects' VMs to Ubuntu (LTS) 18.04 and Puppet 6 (P6)
 has begun in earnest. The P6 support has been tested and solidified,
 so we have started the coordinated effort to move VMs.
- Gitbox improvements for availability, with some future plans to
 further harden its uptime.
- Upgrades of services: Pootle, Fisheye, Jira
- Work on upgrading mail-archive.a.o and mail-private.a.o to 18.04/P6,
 along with deprecating minotaur mbox archives (in favor of mbox-vm).
- Many conversions from the deprecated MoinMoin wiki over to
 Confluence. Many projects have voluntarily offered conversion and
 coordinated their move with us. Infra has also performed many moves
 for projects that are in the Attic, or for retired podlings.
- Upgraded Crowd, an SSO solution for our Atlassian products, along
 with integrating our LDAP users' authentication.

15 May 2019 [David Nalley]

Infrastructure is operating as expected. No issues that
currently need escalation to the President or Board.

Infrastructure closed out well within budget for FY19
Proposed budget has been submitted to the office of President
for review at the upcoming Board meeting.

CDN Evaluation
As our communities grow, particularly beyond North
America and Western Europe, we're seeing more calls to
have greater geographic diversity in a number of our
services, simply to reduce wait time. Following our
team meeting in New Orleans, we've been working on
evaluating several different CDN options. Two projects,
both with significant Asian participation have been
assisting us in this evaluation, with their project
websites being moved to a CDN, and thus far the
feedback has been positive. Much more testing and
evaluation will happen before we arrive at a decision
on whether to proceed or in what direction.

Long Range Priorities
- Test and migrate to the new email setup ("hermes-vm2").
- Turn off mail-archives.a.o and mail-private.a.o, in favor of
 lists.a.o, leaving behind a redirect system for permalinks.

17 Apr 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

Infra is within budget, as we close this fiscal year, primarily due to
the open headcount until just recently. The FY20 budget is mostly
developed, contains no surprises, and will be presented to the Board
in May, as part of Operations' overall budget.

Team Meetup
The team gathered in New Orleans on Friday, April 12th. Over the
course of Saturday, Sunday, and Monday, we completed much more work
than expected. Some highlights are launching a CDN experiment; meet
our two new hires (Drew and John); puppetize our auto-block system;
complete the testing, puppet configuration, and removal of technical
debt around our email system; upgrade Confluence; deep discussions on
team goals such as ticket handling and service-level expectations; and
overall bonding among the team. Going forward, we will absolutely
continue this team meetup format, and switch/diminish our ApacheCon
goals to be primarily outward-facing (instead of a work agenda).

Mail Downtime
The host running our email service failed.  If this sounds familiar,
we indeed had something similar in May of 2014.

Thanks to a lot of staff work in deconstructing, understanding and
automating, and maintaining our mail systems down time was more than
an order of magnitude less. (~6 hours vs 5 days).  While we don't
necessarily want to incur any downtime, we're very pleased with the
relative resilience we've been able to produce. Work is ongoing, to
drive MTTR for complex services another order of magnitude.

Long Range Priorities
- Test and migrate to the new email setup ("hermes-vm2").
- Turn off mail-archives.a.o and mail-private.a.o, in favor of
 lists.a.o, leaving behind a redirect system for permalinks.

20 Mar 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

We completed our hiring process, and brought aboard two new system
administrators: Drew Foulks and John Andrunas. They started the week
of March 18, 2019.

Short Term Priorities
- Onboard the new employees.
- Deprecation of the MoinMoin wiki, and the associated moves of
 projects over to Confluence or the GitHub wiki system.
- Evaluate hosted solutions for Reviewboard and Sonarqube.

Long Range Priorities
- Complete puppetization and migration off the old hardware, then
 begin a mass migration to Ubuntu 18.04 LTS, and Puppet v6.

General Activity
- Upgraded Confluence, and its various plugins. We have now configured
 Confluence to use standard/simplified Apache LDAP sign-in, while
 retaining the ability for the public to create accounts and
 contribute to wiki pages.
- Working with M&PR to stage a new website design.
- Our automated ban/blocking system called "Blocky" was completely
 revamped, as a result of the Eclipse fallout (described below). The
 new Blocky better handles bans, limited-time bans, whitelists, and
 limited-time whitelists. The user interface has been rebuilt to
 simplify the discovery, ban repeal, manual ban/whitelist, and other
- Jira plugins were upgraded.
- Our install of Crowd (an SSO solution for Atlassian products) has
 been upgraded, and we're rolling it out across our ecosystem.

Uptime Statistics
We experienced some downtime on our TLP servers due to a bug in the
Eclipse IDE plugin loading system. The Apache Geronimo project removed
an obsolete plugin (that was erroneously hosted or referenced to our server), and Eclipse did not recognize the 404
response from our server. Eclipse retried endlessly/immediately. This
appeared to coincide with some Eclipse-related event, as the removal
occurred long ago. Whatever instigated the fetch from our servers, it
ramped up quickly and arrived from thousands of source addresses all
around the planet. The team worked on various remediation efforts and
blocks, but the final answer was to simply reinstate a file that
Eclipse could retrieve. This stopped the traffic load, and our systems
have returned to normal.

20 Feb 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

Short Term Priorities
- Finish planning/booking for Infra meetup in New Orleans in April.
- Make offers to new system admins.

Long Range Priorities
- Complete the hermes-vm transition. Mail and mailing lists appear to
 be working, yet more testing and edge case features need to be
 completed before we move a subset of email flow to the new system.

General Activity
- All projects have been moved to Gitbox, and the old git-wip box has
 been decommissioned. Related work with git commit emails was completed.
- The team has been interviewing our final set of candidates. Offers
 should go out within the next week or two.
- New "yellow alerts" class of warnings on PagerDuty to warn on-call
 about an issue, without a late night page.
- Continued work on hermes-vm, our mail system replacement for the
 aging hermes box at OSL.
- Several systems have been deployed with our new Ubuntu 18.04
 support, and using Puppet v6. These rollouts confirm and improve our
 ongoing work on this platform. We'll begin upgrades of existing
 puppetized systems to 1804/p6 later this year.
- Moved DNS "hidden master" off the again minotaur.a.o box, to a new
- Per above, minotaur is mostly deprecated at this point, and general
 login has been disabled, along with old cronjobs that were running.
- CPU rate limiting has been installed on some of our sites that are
 usually subjected to scrapes/crawls.
- Bugzilla has been migrated to a beefier server. Along with the new
 rate-limiting, it should have much better uptime. This will also
 decommission about four separate VMs that were being used to run BZ.

16 Jan 2019 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

Infrastructure's 5-year budget forecast was completed, and is included
in this month's President's report.

Short Term Priorities
- Hire new people.

Long Range Priorities
- Move to Puppet 6. Most of the infrastructure currently runs on v3,
 with a couple older systems still on v2. We have been lifting
 non-puppet systems to v3 over the past two years, during our
 migration away from ASF hardware into the cloud. v6 planning,
 design, and initial implementation has been in-progress for about a
 year (v5 for most of that, and v6 recently). This tooling will be
 paired with upgrades to Ubuntu LTS 18.04 (most of our systems run on
 LTS 16.04 and LTS 14.04). Rollout of v6/18.04 will begin this year.

General Activity
- Moved to Slack. The Infra team's locus of activity has relocated
 from HipChat to Slack. Our users have been invited to the new chat
 system for quick queries and requests.
- Completed migration away from zmanda backup software. All backups
 are now rsync-based.
- Jenkins and its plugins were upgraded just before New Years. One of
 the Jenkins master's drives went offline and was replaced. We had
 eight minutes of downtime for the drive replacement, and two hours
 while Jenkins started itself back up.
- Continued work on a new "hermes" (aka our core email transfer
 agent). Our test system is now running on Puppet v6.
- Various refinements of our monitoring/status/paging systems with our
 Slack setup and usage.

Git-Wip-US -> GitBox Migration
Migration from git-wip-us to gitbox was announced in November 2018,
starting with voluntary coordination between projects and infra, and
forced migration on February 7th for remaining projects.

During the current migration time frame, from November till now, around
65 projects have moved some 325 repositories across to gitbox
voluntarily, and coordinated with infra, so as to minimize downtime and
work disruption. As of January 15th, approximately 150 repositories,
covering 50 projects, remain on git-wip-us, while 1300 repositories,
covering 175 projects, are on gitbox. Thus, the split is ~90% of all
repositories on gitbox and 10% remaining on git-wip-us.

We have sent out an additional email to all remaining projects on
git-wip today (Tuesday Jan 15th), three weeks before the forced
migration, to see if we can coordinate migration with them on a
voluntary basis and in a non-disruptive way before the deadline.

On February 7th, we will conduct a mass migration of the remaining
repositories, as well as a major spring cleaning of repositories belong
to retired podlings or attic'ed TLPs.

19 Dec 2018 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

Short Term Priorities
- Move from HipChat over to Slack. We are signed into both platforms,
 but much of our day-to-day work is still on HipChat.
- Perform a Jenkins/plugins upgrade.
- Hire new people.

Long Range Priorities
- Move to new email infrastructure.

General Activity
- GitBox migration is now active. Projects have been told they "must"
 move eventually, and we're taking volunteers right now. Each
 community can vote when a move is best for them, and we will migrate
 them. When the flow of volunteers slows, we'll put out another call
 and/or suggest a deadline when we'll perform a mandatory migration.
- Reviewing a huge number of resumes.
- Continued movement to rsync-based backups, so we can avoid paying
 for a Zmanda license in January.
- The AOO wiki and forums have been migrated to new cloud hardware
 (away from our VMware hosts at OSU/OSL), newer software, and the old
 stuff has been decommissioned.

21 Nov 2018 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board. This report
covers two months, as we missed our report in October.

The Infrastructure Team has "taken over" management of,
and will be using that for our needs going forward (as Atlassian's
HipChat has been end-of-life'd). As a side-effect, we will officially
support this Slack workspace for existing channels/projects, and for
those who may join going forward. We signed up for Slack's NGO
benefits for the workspace, though it is possible that heavy usage may
move us into a range where we will be billed for some usage. We will
continue to monitor the usage, billing, and determine if any
strategies are needed to keep the Foundation's chat expenditure down.

We are operating within/ahead of budget. We are actively processing
submitted candidate resumes to fill two open headcount, which our
budget will cover once we fill the positions.

Our renewed sponsorship of cloud credits has arrived, with the
requested three year duration and a generous credit. We will be using
these credits to expand our GitBox service to all of our projects.

Short Term Priorities
- Complete our hiring process.
- Finish deprecation/replacement of the Zmanda backup service with
 rsync-based backup. (this will save on licensing costs)

Long Range Priorities
- Turn off all of our old hardware. Some projects are still on this
 hardware, along with some core Infrastructure services. This creates
 both stability and security issues for us.

General Activity
- ACNA in Montreal was a success for the team. We spent a lot more
 time together this year, than our time in Miami (May 2017). We
 accomplished a long list of tasks that took advantage of the higher
 bandwidth of face-to-face communications.
- Continued work on GitBox, in preparation for moving all projects
 over from git-wip.
- Work on replacing our email infrastructure (from an old FreeBSD
 system creaking along, to an Ubuntu-based server) continues. The old
 server lives under a decade of patches and custom work, so our
 largest hurdle is archeology. The new server is reaching a state for
 testing, and then slow migration in a month or three.
- Various work on making our monitoring, logging, and alerting
 processes more regular across our multitude of VMs.
- A number of our donated Jenkins nodes had small drives, which had
 been causing problems. The sponsor has upgraded those VMs to larger
 drives, and we've re-imaged and re-integrated them into Jenkins.
- We had a security issue with our Sonarcube analysis server that has
 been resolved, along with assisting some projects to use the new setup.

17 Oct 2018 [David Nalley]

A report was expected, but not received

19 Sep 2018 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

Our job opening is now posted at

Short Term Priorities
- Establish a new DNS hidden master, deprecating that function from
 our old minotaur.a.o server. This will also puppetize and trim the
 domains that our instructure supports. The bulk of our owned domains
 are handled by our domain provider.

Long Range Priorities
- Moving to Puppet 5, and to Ubuntu 18.04

General Activity
- Restored dSNMP monitoring, after our prior provider ended the
 service. The tool is a F/OSS script pulled from GitHub.
- Rebuilt our logging storage/query service on new machines more
 finely-tuned to the ElasticSearch installation. This expanded our
 storage capacity from a single month to over a half-dozen.
- Various Jenkins node work (Windows, disk space, etc)
- Streamline git integration/config with ASF mailing lists, jira

Uptime Statistics
We had an extended downtime (about 60 hours) after the backplane for
our Jenkins Master machine blew up. We were able to bring up a new
machine, get the build data moved over, and turned the service back
on. We picked up more storage space on the master, in the process.

15 Aug 2018 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

Infrastructure is running under budget, primarily due to our open
headcount. Due to our continuing growth, we will start the process to
get that position filled.

Short Term Priorities
- Move svn-master.a.o
- Gitbox migrations

Long Range Priorities
- Starting work on Puppet v5 and Ubuntu LTS 18.04
- Continuing work on "hermes migration" (our core email server) from
 its old box, to something maintainable

General Activity
- Our team has finalized travel to ACNA18 in Montreal, and will be
 meeting with the community, along with lots of time with teammates.
- Now running a script to trim all Jenkins slave nodes of old
 artifacts, which were especially problematic on a few small-disk
 nodes that we are running. This should reduce the rash of tickets
 that were filed about "out of disk" on the slave nodes.
- Investigating alternatives to HipChat, now that Atlassian sold the
 product to Slack. The Infrastructure team needs a sophisticated chat
 system for daily use, with a number of integration requirements. We
 are working through the options to meet our needs, with an eye for
 other uses throughout the Foundation.

18 Jul 2018 [David Nalley]

Infrastructure is operating normally, and there are no issues for the
President or the Board.

Short Term Priorities
- Jira will be upgraded to its latest release. We schedule these
 upgrades for about every six months, rather than attempt to track
 every upstream release.
- Our GitHub integration has been running into API rate limits. This
 will require some new mechanisms to reduce our calls.
- The team is going to investigate the new mod_md module for automated
 integratino of LetsEncrypt certificates. This will simplify
 certificate management on project VMs and will enable us to use LE
 on the main TLP servers.

General Activity
- Jenkins was upgraded to the latest LTS release, along with upgrades
 to the large number of plugins used by our projects. To assist with
 tracking before/after versions of the plugins, we've automated much
 of the reporting through the Jenkins API.
- Apache Maven moved off of the deprecated CMS system to a simple
 svnpubsub arrangment, with generation occurring through their own
 workflow. Maven was one of the largest users of the CMS (it has
 around 1.1 million pages), so this is a big step in the long path
 towards decommissioning the CMS.
- Buildbot was upgraded, including underlying upgrades to the
 operating system and our puppet scripts. Follow-on improvements to
 buildbot script management have improved this service for our projects.
- The team has begun making plans for ApacheCon in Montreal, to meet
 with the community and for our yearly team meetup.

20 Jun 2018 [David Nalley]

* One of our sponsors has provided cloud service credits for Infra to
 use for running our services. These credits are running out at the
 end of September, but we have received assurances they will be
 continued. The Infra team has been working with Fundraising to
 secure the continued credits, hopefully with an increase and with
 public recognition under the "Targeted Sponsorship" program.
* Another sponsor has been providing cloud service credits, and Infra
 has been working with Fundraising to streamline, simplify, and
 recognize their donation.
* Taking advantage of our puppetized services, the team has been able
 to easily shift services around to reduce our overall costs. In
 particular, Europe has very low bandwidth charges, so we have moved
 heavy users (rsync, archive, TLP) to European data centers.

Short Term Priorities
- The Apache Subversion server that we run has been having some
 difficulties related to backups. The server runs in the United
 States, and backs up to Europe. We plan to move this server to
 Europe, which will have second order effects on costs, some
 downstream services, and reliability.
- Our Jenkins build environment is due for an upgrade to the next LTS
 release, as does Jira. We will be upgrading both of these systems
 over the next month.
- In order to drop our open issue count, the Infra team is planning a
 bug bash to occur on Monday, June 18th (two days before the Board

Long Range Priorities
- Moving our email instructure from old FreeBSD-based systems into
 newer, puppetized services has been an ongoing, difficult, and
 long-term goal. Some good progress has been made on this front over
 the past month, but we have lots of investigation, building, and
 testing before enabling the new setup of this critical service.

General Activity
- Confluence was upgraded to 6.9.0
- Continued work on our new status and monitoring changes
- We had some failing disks on our mailing system, which took several
 days to sort through. Additional disks were ordered, and we have two
 hot spares installed now. This should hold us until our long-term
 work enables our migration.
- We've had a project, Fineract, request a project VM with a resource
 allocation an order of magnitude higher than what we typically
 provide. We've told them that this request exceeds our ability to
 provide the service, and redirected them to the Board to seek
 funding for such a service.

16 May 2018 [David Nalley]

No issues for the President/EVP, nor for the Board.

Infrastructure ended up at a slight 0.3% overage on our $800K+
budget. We will call this a success. Leading into FY19, we will
continue our ongoing efforts to minimize costs and to provide as much
value as possible from the budget provided to our department.

Short Term Priorities
- Upgrade Confluence to the latest release
- Switch to the newly-provisioned Jenkins Master
- Finish tooling to enable easy/mass migration from git-wip to GitBox

Long Range Priorities
- We are currently running on Puppet v3. aka Old. Moving to v5 will
 take some considerable effort, and will be started this year. Quite
 a bit of research on the impact to our operation has already been
 done, and how we can run v3/v5 side-by-side as we perform the
 migration. This is lower priority than decommissioning old hardware
 and moving those systems into puppet.

General Activity
- Our page has been moved to This
 involved a move from system monitoring via PingMyBox to NodePing.
 This has simplified some of our status/monitoring flow, and brought
 some new features (particularly around scheduling maintenance, and
 paths for people to sign up for notifications). PMB was performing
 some traffic routing checks which is not available in our new
 system, but that is acceptable.
- The old mail-search.a.o server has been moved to mail-search-old,
 and our new replacement is mail-private.a.o. It has gone live, and
 has been working terrifically (long term, it will disappear in favor
 of lists.a.o)
- We have brought up a new "TLP Server" within Azure, and are running
 tests to ensure everything is working correctly. It is in DNS round
 robin, so is serving live traffic at this time. Concluding this
 effort will allow us to turn off some very old hardware.
- In the vein of turning off old hardware, we have also started some
 work on puppetizing our core email system ("hermes"). This will be a
 long and fearsome process to transition into the cloud, but we'll
 get there one day.
- We had an unscheduled upgrade of FishEye after receiving
 notification of vulnerabilities in our version.
- Our rsync server for mirrors has been set up on a single-purpose
 bare metal box (it used to be shared with TLP servers, and fought
 for I/O). We will shortly disable the older servers, in favor of
 this single machine, in order to reduce bandwidth costs and reduce
 the I/O contention with other ASF services.

Uptime Statistics
We've been having difficulties with our Jenkins Master server, and
have needed to reboot it several times over the past week. The root
cause is still unknown. We have chosen not to dig in just yet, as
we've acquired a beefy new server tuned to become the new master. The
old one had I/O contention with fellow VMs on the box; this new one
will operate on bare metal with wicked-fast I/O. Post-move, we'll dig
in should similar problems arise.

18 Apr 2018 [David Nalley]

No issues to raise to the President or the Board at this time.

Based on "actuals" through March, Infrastructure is slightly over
budget (about 0.5%). These actuals have been used to prepare the FY19
proposed Infra budget attached to the Board's April agenda.

Short Term Priorities
- One of our third party services provides us with service monitoring
 and reporting. It will be shutting down in a few months, so we have
 begun to look for replacement service(s).
- We are looking standing up a new Jenkins "master" node which should
 improve the responsiveness/throughput of our Jenkins CI cluster.

General Activity
It has been a relatively quiet month, though we dealt with a seemingly
email-based DoS, a two hour datacenter downtime, and a couple disk
failures on our remaining hardware (which demonstrates our desire to
continue migrating to cloud-hosted providers).

21 Mar 2018 [David Nalley]

Infrastructure is operating as expected, and has no current issues to
bring to the attention of the President or the Board.

We have entered "budget season". Infrastructure has submitted its
first draft to the President for inclusion into a draft budget. That
draft will be presented to Members during our Annual Members Meeting
(starting March 20th), and later to the Board that will be elected
during that Meeting.

For Fiscal Year 18 (May 2017 to April 2018), the Infrastructure team
has managed its outlays and should end the year on-budget. We've
adjusted our FY19 request based on past-year actuals, and expected
growth in outlays corresponding to the Foundation's growth.

Short Term Priorities
* We have picked up some responsibilities for running Apache STeVe as
 the voter tool, used by the Foundation, for its Annual Meeting
* The feature for "searching of private ASF archives" is being
 removed, in favor of a longer-term solution focused around However, still resides on
 old hardware and needs to be migrated before our expected timeframe
 for validating "full production" of lists.a.o. The browsing of
 private archives is in-process, and should be deployed shortly to

General Activity
* As part of our migration plan, has been
 moved to a brand new VM (rather than ASF-owned metal). The site has
 been reviewed, verified, and switched over to the new VM. This is
 just one part of our overall migration process, and upgrading the
 many systems involved in our mail handling, but a very necessary
 part that is now behind us.
* We have been continuing work on ASF users being able to "self serve"
 their needs. Over the past month, we've improved the ability for
 Incubator Members be able to request new respositories, mailing
 lists, and other resource for their podlings. Another facility was
 added to assist with recovering from problems in website updates.
* Using some of the Travis CI APIs, and working with their support
 team, we have been able to analyze our usage of running builds,
 their concurrency, and queueing of builds. This will help us to
 review our capacity planning, and determine growth needs around our
 projects' use of Travis.

21 Feb 2018 [David Nalley]


Infrastructure continues to operate normally.

There is an open question for the Board to consider, regarding high
per-project expenditures. More below.


Infrastructure is on target for FY18, and is ready for the upcoming
budget process for FY19.

Short Term Priorities

- Begin work on a TLP server operating from within Azure.
- Investigate shared service usage by the projects, to look for and
 correct outliers consuming too many resources. In particular, we are
 looking at buildbot, Jenkins, and Travis.

Long Range Priorities

- Tooling continues for GitBox bulk moves, in order to support moving
 all ASF projects from git-wip over to GitBox during 2018.

General Activity
- Jenkins and Bugzilla were both upgraded on short notice, due to
 announced security vulnerabilities.
- LDAP server simplification has been finished.
- We spun up, and migrated to a new ElasticSearch cluster, which is
 used for log recording, auto-banning for overuse, and log
 searching/perusal. Our old cluster had underpowered disks and could
 not keep up with our needs.
- We operate a couple "email relays" for moving internally-generated
 email out to the world, and to our mailing lists. The old relay was
 moved off private hardware, to a pair of cloud-based VMs. This is
 part of our long-term work to improve, puppetize, and harden the
 email services provided by the Foundation.

Project Expenditures

As part of our work to move away from ASF-owned hardware to
cloud-based services, we originally expected our team to migrate two
VMs associated with a single project which handle their wiki and their
forums. We placed a pause on that work, noting that these are
project-specific VMs, whereas Infrastructure normally provisions *one*
per project and expects the project to maintain (which includes
migration when necessary). An explanation, background, and concern was
raised to the VP Infrastructure and to the President about this topic.

After some discussion was held, it appears there is a question that
needs Board feedback:

If the Board wants to allow additional spend for specific projects,
specifically using Infrastructure costs beyond its budget, then what
does the Board require to make such a decision?

Historically, the Board has generally stated that projects and their
communities should be self-sufficient and receive no additional
Foundation funds to support them beyond their internal ability. If the
Board maintains this policy, then the Infrastructure team will work
with the project to develop a transition plan for these VMs and their
future maintenance by the PMC. Should the Board decide that it would
like additional Infrastructure budget assigned to one or more specific
projects, then it would seem a proposal, discussion, and decision
process would be desired. It would be expected these costs would be
called out during the budget process, as an opportunity cost against
Infrastructure's normal operational budget.

17 Jan 2018 [David Nalley]

Infrastructure continues to operate normally, with no issues to raise
to the President or the Board.

Short Term Priorities
* Move more content onto infra.a.o
* Move our DNS hidden master off of minotaur.a.o, as part of our
 long-term program to deprecate minotaur.
* Prepare for moving to Atlassian's "Stride" chat system, as HipChat
 is being deprecated. Initial testing has already begun, but we do
 not (yet) have a hard date on when they will cut us over.

Long Range Priorities
* Continue deprecating old hardware and operating systems, moving onto
 VMs and machines at third-party hosting services

General Activity
Jira underwent a major upgrade on January 14. We are now on the
almost-latest version of Jira (they released a new version last week).
This allows for some better integrations with our LDAP system and
other Atlassian software that we run.

We have continued to make improvements to the GitBox system, including
a more responsive list of respositories. We've been working on tooling
to perform "mass migration" of projects from git-wip over to gitbox,
with the goal of deprecating the Foundation's operation and management
of git repositories. We will retain a mirror/archive of projects'
repositories on GitHub (to meet Legal Affairs' requirements), but
these are much simpler to manage, being readonly and low traffic.

Our work on simplifying the LDAP system has progressed nicely. There
are some remaining items related to older operating systems, and some
external partners that we will be wrapping up. This has reduced the
number of LDAP-related incidents that we used to see, interfering with
operations of our infrastructure.

The holiday season also gave us a great opportunity to upgrade Jenkins
to its next LTS release, and to make numerous improvements to its
speed and reliability. As Apache Maven has reported, we've worked with
them to improve job management and web site generation.

Uptime Statistics
We continue to maintain a high state of uptime, meeting our goals.

20 Dec 2017 [David Nalley]

Infrastructure is operating normally, and believes there are no issues
for the President or the Board.

The Infra budget is the largest at the ASF, and we are holding to our
FY18 goals, so all is fine here. As noted last month, the FY18 year
budget was incorrectly forecasted (though we're holding to it). That
has been corrected in the five year budget that we presented to the

We've cleaned out a lot of stale virtual machines and volumes in AWS,
saving us some good money (we're currently spending against credits
that Amazon provided to us last year).

Short Term Priorities
* Get our team familiar with Azure and start using some of the credit
 donated by Microsoft.
* Wrap up our LDAP server(s) simplification plan.
* Upgrade Jira.

Long Range Priorities
* Deprecate git-wip and move all projects onto gitbox (GitHub as master)

General Activity
Oath/Yahoo donated a bunch more Jenkins build nodes. We've also
updated the connection mechanism so that externally-run nodes (such as
those ran by the MXnet podling) could more easily connect to our
Jenkins Master.

We've had some ongoing work to replace mail-archive.a.o and
mail-search.a.o until we get lists.a.o certified (we still need to
lay out our requirements for that).

15 Nov 2017 [David Nalley]

Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

Much work has been done around finances this past month, but we remain
on track to meet our overall budget.

With the help of Tom from Virtual, we investigated an oddity in our
staffing expenditures. Given that we currently have an open headcount,
our year-to-date costs should be low, but they were running at our
budget levels instead. The answer is that Infra incorrectly forecasted
FY18 costs from the FY17 actuals. We left the headcount open for a
while, and along with savings in other cost centers, when we fill the
headcount for the remainder of FY18, the total Infra costs should fall
within its total budget.

One of our secondary (large) budget items has been variable, due to
currency fluctations. The structure of the payments and related
sponsorship is due to change soon, and should create a more stable
cash flow. The Treasurer and Fundraising teams are working on this.

Short Term Priorities
- Working out plans to move some services into Azure, under their
 programs to assist F/OSS organizations.
- Digging into our VM usage across our various datacenters to find the
 top-line costs, for future tradeoffs and cost-savings.

Long Range Priorities
- Continue VM migration and puppetization. Our major concerns here
 revolve around mail handling.

General Activity
We have consolidated our domain management under Namecheap to reduce
expenses, make use of their API, and get better support. We plan to
expand/shift our DNS management to Namecheap as well, away from our
custom solutions.

Uptime Statistics
We had an outage on mail-archives.a.o that took us a while to correct.
It hightlighted our ongoing issues with services on old hardware, old
operating systems, and non-puppetized services. We corrected the
issue, but have started on additional disaster-relief measures. Given
the age of these services, monitoring is difficult.

18 Oct 2017 [David Nalley]

Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

Following on from last month, Infrastructure's finances are in good
shape, and running under budget on many of our cost centers. However,
we still have one open headcount with an unknown impact. We will
continue to manage against the budget.

Short Term Priorities
- Schedule Jira and Confluence upgrades
- Complete our LDAP system upgrade from master/master/many-slaves
 to a simpler master/few-slaves layout

General Activity
- Major Jenkins upgrade to the latest LTS release was completed.
- Rebuilt JIRA database, in preparation for its next major upgrade.
 The upgrade date is still unschedule, as we need to do some
 additional test runs.

Uptime Statistics
No significant, unschedled downtime occurred during the past
reporting month. Our systems continue to run with high availability.

20 Sep 2017 [David Nalley]

Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

There appears to be some strange numbers in this month's financials,
so some further analysis will be done over the next couple weeks. This
is likely a simple matter of misapplied cost centers. Will provide an
update next month.

Short Term Priorities
- Upgrade Confluence, Jira, and Jenkins. We will schedule these for
 weekends, as several hours of downtime will be required for each.

Long Range Priorities
- First priority: shift towards puppet-managed cloud-based services.
- Various automation, which likely will involve working with the
 Whimsy folks to loop this into their product.

General Activity
- Seven machines de-racked at OSU/OSL, as part of our mission to move
 from ASF-owned hardware to cloud-based infrastructure
- Several disks at OSL were reporting a predictive failure, so we
 ordered a few more and are working with OSL to get them installed
 and looped back into the RAID system. Even though these systems are
 on a path to decommission, a disk failure would have been much more
 expensive, than the cost of the replacement.
- Towards our goal to decommission minotaur, mbox-vm has been created
 to function as our official archive of all mail content/archives.
 Minotaur is serving this function today, with mail-search.a.o and
 mail-archive.a.o slaving their content from minotaur. mbox-vm is
 still a work in progress.

Uptime Statistics
The Infra team has continued its excellent uptime record. We had no
significant downtime on any system this past month.

16 Aug 2017 [David Nalley]

No Board-level issues at this time.

The InfraAdmin has been working with our Accounting team to clear up a
workflow and invoice issue with one of our vendors. Our vendor will
now sending invoices directly to Infra for review and approval.

Operations Action Items
We have a service contract with Symantec for signing binary releases.
That contract expired in June, and we are working to re-up signings
for the next year. This has been hampered because we need to
"authenticate" our business according to new rules from the CA/Browser
Forum. We are also switching the primary contact point to the
InfraAdmin and our VP Infra email address, to provide backstop points
on contract/renewal issues. The authentication process is involved,
but is progressing. With some hope, it will be solved by end of
August.  This service is primarily used by our Apache OpenMeetings and
Apache Tomcat PMCs.

Per above, our D&B records reflect our new Wakefield postal address,
but many other third-party records are out of date. These need to be
updated to ease verification of our business.

Short Term Priorities
- Complete our LDAP schema changes, and server layout.
 - The schema changes have been performed in conjunction with the
   Apache Whimsy development team to provide tooling, and design
   thoughts on our updated schema. The goal is to provide a unified
   view of Incubator podlings and TLPs.
 - Our LDAP servers are configured as a multiple-master system with
   multiple slave replicas. We will be winding this to a simpler
   single-master and a few replicas layout.
- Expanding gitbox usage to more projects.
- Improved/automated monitoring via DataDog.

Long Range Priorities
- Moving all projects over to gitbox. We are still discovering some
 edge cases in our tooling, so the mass-migration is not "now".
- Upgrades to our Confluence and Jira installations, along with moving
 them to use Atlassian's Crowd product for single-sign-on.
- Revamp of our DNS management; see below.

General Activity
Over the past few weeks, we've done a lot of work with the domains
that we manage. As domains are coming up for renewal, we've been
moving them to Namecheap. However, this process will likely accelerate
as Namecheap provides API-based facilities that can help our DNS
management. After some testing, it appears that we can defer/diminish
our DNS work by shifting those functions over to Namecheap.

Lots of work has been done to ensure that our historical/archival
mailing list content is correct. Thanks to Sebb and Gavin for the
grunt work to make this happen. In addition, the team is now reviewing
what remains to declare as the only service for
archive access (and turning off mail-search.a.o and mail-archive.a.o).

19 Jul 2017 [David Nalley]

Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

Operations Action Items
- Continue discussions with the Whimsy community on Foundation
 workflows, and how Infra can support their work (more below)

Short Term Priorities
- Fill the open headcount. The past few months have been an evaluation
 of the team with five FTEs, and it is (now) clear that our workload
 demands all six allocated/budgeted positions to be filled.

Long Range Priorities
- We continue to reduce technical debt by puppetizing services (which
 reduces manual configuration) and moving to cloud-based hosts and
 VMs (reducing reliance on our own hardware, and offering flexibility)
- Automation of Infra tasks is on our long-range planning, but has
 been placed at a lower priority relative to puppetizing services and
 migration to cloud-based VMs. However, the Whimsy community has
 engaged with the Infra community to add various tooling/services to
 their tool. Several workflow improvements have occurred, with more
 planned and/or waiting to be rolled out.

Jenkins Upgrade
Jenkins was migrated to a new VM and VM host on Saturday, July 15.  At
the same time, Jenkins was upgraded to the latest 2.60.1. Work had
been done in preperation in writing a new jenkins_asf Puppet Module
and Yaml in readiness for this and a builds-test soaked in and allowed
tweaks to the puppet config.

A fair amount of downtime (over 10 hours) was due to a final rsync of
data after turning off Jenkins, the installation of upgraded plugins
before the overall upgrade, and another round of plugin upgrades after
the upgrade (for those plugins not compatible until new version was
installed). All nodes also had to be updated to use a JDK1.8
connection. This has consequences for those projects using Maven type
jobs and want to build with JDK1.7 and earlier, but this has been
discussed and workarounds noted.

The mailing was notified many weeks ago about the
upcoming upgrade and migration, and again shortly before the upgrade
happened. During the upgrade, Twitter, builds@ and operations@ were
also kept in the loop every few hours until completion.

At this stage, builds.a.o is operating smoothly with just a few
non-infra owned nodes needing to reconnect. We may bump the VM memory
after a few days of stats collecting, but we'll see.

In terms of technical debt, this allows us to retire physical hardware
(crius) which is located at OSU/OSL. Our new Jenkins Puppet Module
also gives us much more flexibility to move, restore, and otherwise
manage the Jenkins system.

21 Jun 2017 [David Nalley]

Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

The Infra team was able to meet last month during and after the
ApacheCon in Miami. This was a chance for the entire team to get
together to discuss work and for team bonding.

Short Term Priorities
- Discuss and possibly perform an additional Confluence upgrade to a
 more recent version
- LDAP changes to better support project membership management, and to
 support Atlassian Crowd (a single sign-on system for their products)

Long Range Priorities
- Continue retiring ASF-owned hardware, reduce technical debt, and
 update documentation/runbooks

Uptime Statistics
- Confluence was taken down to perform an emergency upgrade after we
 learned of a critical CVE in the version that we are running.
- Jira was taken down for about two hours to perform a planned upgrade
- We maintained our SLA requirements, even with the upgrades

17 May 2017 [David Nalley]

Infra is operating normally. Our main focus continues on retiring
technical debt.

Short Term Priorities
- Hire new person to fill the open position
- Team meetup at ACNA 2017 in Miami

Long Range Priorities
- Finish puppetizing all services
- Decommission all ASF-owned hardware at OSU/OSL

General Activities
Migration from OSU hosted VMs is still progressing. We've graduated 2 new
TLP in the last two weeks (CarbonData, Fineract).

Uptime Statistics
Overall, uptime met our SLA. We had some issues with web crawlers
abusing our services, which led to us having to block an entire /16
block from Digital Ocean. We are working with them to sort this out.

Our service (run on eos at OSU) had a hard drive
failure and was down for a few hours till we had the disk replaced. The
service is now up and running again.

Community Growth
This period, we've had one new contributor to our puppet repository,
as well as contributions from 9 people who contribute on a regular
basis. on the JIRA side of things, we had 23 new people interacting
with infra via JIRA, making it 23 new users, 102 regulars (people that
have contributed before and do so often), as well as 8 'returnees'
(people that have been absent for >2 years but are now contributing/
reporting again).

On a 3 month view, we've had code contributions from 19 people, of
which 13 were regular contributors and 9 were new. 73 people have worked
on or reported a JIRA ticket for the first time, while the other 184 who
worked on or reported issues had doen so before.

Cost-per-Project Reduction
Work has been done on estimating the cost-per-project for
Infrastructure. See for details.
This is still at an early stage, and thus the figures are not
entirely accurate yet. We are working towards better managing
the documentation on time spent per project/component.

19 Apr 2017 [David Nalley]

Infra is operating normally, although we are sad to see one of our
teammates depart. Our main focus continues on retiring technical debt.

March actuals show us running about $11k over for the FY17 year. The
FY18 budget is based on recent actuals, plus increments that were not
budgeted in prior years. In particular, costs related to staffing and
increased cloud costs due to shifting away from ASF-owned hardware in
the Open Source Labs at OSU. We continue to balance our cloud leasing
primarily between LeaseWeb, AWS, Online.Net, Hetzner, and some smaller
footprints elsewhere. Each provider has individual benefits for the
type of service that we are running.

Short Term Priorities
- Hire new person to fill the open position
- Team meetup at ACNA 2017 in Miami

Long Range Priorities
- Finish puppetizing all services
- Decommission all ASF-owned hardware at OSU/OSL

Uptime Statistics
Our server running the Jenkins master (crius) experienced a hard drive
failure in its RAID array. We lost no data, but the server did become
non-responsive while the array needed to be examined. "Hands" at OSL
replaced the drive with a spare that we had on-site, and a couple days
later the RAID array was back in full operation (the array is under
heavy I/O load, so the rebuild took a surprising amount of time).

We are now looking into moving the Jenkins master to cloud hardware
sooner rather than later. The hardware failure is a perfect example of
why we want to move away from our own hardware.

15 Mar 2017 [David Nalley]

No Board or Executive action is requested at this time. Infrasturcture
is operating normally, without issue.

The Infrastructure team is working within the FY17 budget set by the
Board in December 2016. Planning for Fiscal Year 2018 has begun, based
on the five-year budget outlook prepared earlier.

General Activity
The Infra team has begun booking travel and hotel for ApacheCon in
Miami, in May. We will be using the conference for team education, for
interaction with the community and volunteers, and for team building
and face-to-face meetings.

GitHub as Master ("Gitbox")
Our work on enabling GitHub as the primary focal point for development
continues. As we add new projects, we've found/added several improvements
to our recording of provenance.

Cost-per-Project Reduction
Our work on reducing our per-project costs continues to be a second
priority, relative to our primary work of VM/machine migrations and
ramping-up of gitbox.

27 Feb 2017 [David Nalley]

Recent Issues
Nothing is needed from the President, or the Board. These are reported
as an FYI only.

We have seen issues regarding SHA-1 vulnerabilities, supporting the
Apache Maven project, and changes in our build systems. Details are
provided below, in the "General Activity" section.

We spent $525 to purchase several months of online training. This is
an unbudgeted amount, but we believe its unlimited coursework for the
entire team, for three months, was worth the experiment. At the end of
the period, we will evaluate whether an extension is warranted.  Costs
for ongoing staff education will be included into our next budget

Short Term Priorities
- Decomission ASF-owned hardware at an accelerated rate, in favor of
 cloud-provided servers
- Continue ramping up the Gitbox service
- Balance our datacenter usage for cost efficiency
- Gitbox/Jira integration

Long Range Priorities
- As reported before: continued migrations of our legacy servers and
 services into new puppet-based services that we can efficiently
 deploy to cost-effective cloud providers.
- Automation to reduce the incremental cost of regular Infra tasks
- Migration from Puppet V3 to $nextgen system for providing services

General Activity
We set up a new web area for the Directors to create an authoritative
set of pages for Board-approved policies and commentary. Some initial
work by Directors has populated some data/pages, but the site design
and content is still in its infancy. The plumbing appears to work, so
we're "done" and will follow with continued support.

There has been a lot of Internet discussion about Google and CWI
finding and publishing a SHA-1 collision, and their statement that it
is now possible to construct additional collisions. They will be
releasing further data in a few months.

From our initial analysis, this issue only affects our Subversion
services as a limited denial of service, instituted by an Apache
committer (NOT by a third-party attacker). The Apache Subversion
community has been discussing and analyzing the issue, including the
extent of the problem and appropriate mitigations. We have already
deployed a script developed by their community, to prevent a committer
(or a compromised account) from pushing either collision documents
into our repository.

We have confirmed that our website certificates DO NOT use the SHA-1
algorithm. This has been in place for quite a while.

This past month, we discovered that we cannot support the growth of
Apache Maven's private copy of the Maven Central repository. We
previously offered the PMC a VM to keep a copy (should M.C go dark,
we'd retain all necessary data for the ecosystem), along with CPU to
perform analysis against that copy, but looking at the storage growth
rate, we determined that this offering was not sustainable within the
current Infrastructure budget. We notified the Apache Maven PMC that
we needed to retract the offering, and for them to seek specific
budget support from the Board for their needs.

Over the past year, the Infrastructure Team has moved to a policy of
"Ubuntu Only" for our machines, to lower our costs. In the past, we
had a lot of time to support multiple operating systems, services, and
customized software deployments. With the rapid growth of the ASF, and
the resulting demand for Infra support, we have pulled back on the
edge cases to focus more strongly on ROI of our staff's work
effort. That has resulted in the Ubuntu policy, which then resulted in
the decommissioning of Solaris, Mac OS, and FreeBSD build slaves in
our buildbot service. Needless to say, that has raised concern within
several probjects who relied on the availability of those platforms.
We have made it clear that the Infra Team will integrate third-party
custom build slaves into our system, so that projects can use those
slaves for their non-Ubuntu builds.

Uptime Statistics
Overall uptime reached 'three nines' with 99.9% uptime. The 'worst
offenders' this period were writeable git repositories (due to a TLS
bug) and Jenkins, though none of the critical or core services went
below 99%.

For more information, please see

Community Growth
This period, we've had one new contributor to our puppet repository,
Jitendra Pandey, as well as contributions from 9 people who contribute
on a regular basis. on the JIRA side of things, we had 29 new people
interacting with infra via JIRA, making it 29 new users, 93 regulars
(people that have contributed before and do so often), as well as 5
'returnees' (people that have been absent for >2 years but are now
contributing/reporting again).

On a 3 month view, we've now had code contributions from 22 people, of
which 13 were regular contributors and 9 were new. 94 people have worked
on or reported a JIRA ticket for the first time, while the other 187 who
worked on or reported issues had doen so before.

GitHub as Master ("Gitbox")
Our Git services are planned to land on gitbox.a.o, so we generally
refer to this as "gitbox". In the past month, we start moving the
OpenWhisk podling over to the gitbox service. That has been a very
slow move, so Tika and Nutch have recently been added to gitbox. It is
too soon to remark on problems and SLA for these communities using
GitHub as their master/primary focal point of development. These
communities have enough activity to help us surface and pinpoint
problems. We've made several improvements based on feedback, and still
need to implement some Jira integrations.

We will likely add more projects before the next Board meeting, and
will report on such additions.

Cost-per-Project Reduction
An important, needed clarification arose this past month, regarding
the definition of this effort. Reducing the cost-per-project is about
managing the marginal/incremental cost each time a project is
introduced to the ASF infrastructure. This effort is not about the
*overall* Infrastructure bottom line (eg. staffing, training, travel
costs) as those costs have *very* tenuous connections to the
incremental cost of a new project.

There is certainly a mild connection to hardware/hosting costs, as we
offer VM services to projects. Those VMs create a very real cost to
the ASF, and we are in-process on a way to track and allocate those costs.

The more direct costs appear to be related to the work that
Infrastructure performs when a podling is accepted, and when a podling
graduates. These events create a lot of work around managing mailing
lists, repositories, Jira, wikis, etc. This is the incremental costs
that we hope to reduce, through automation, once we are done with the
higher-priority work of VM migrations.

18 Jan 2017 [David Nalley]

Finances & Operations
We've been working with Virtual to deal with a process improvement
that results in better privacy protections for HR-related data
for our contractors and employees.

Short Term Priorities
- Get one or more projects launched on the Gitbox system
- Training of our new staffers, particularly towards VM migration
- LDAP changes to support podlings, and to integrate it within our
 supported services (eg. Sonar and Roller)

Long Range Priorities
- Move all services off ASF-owned hardware, including the difficult
 process of moving our email infrastructure
- Finish the use of puppet for all services, then explore to move
 towards upgrading to Puppet 4, and/or containers for ongoing service

General Activity
- Tightened/streamlined git sync processes (see paragraph below)
- Ongoing conversation and development plans for integrating podling
 management into LDAP (and other ASF tooling; particularly, gitbox)
- Finalizing launching Fisheye services locally at the
 ASF to replace the third-party service run by Atlassian.
- Restructured the git repository request service (
 to better handle podlings, in particular assist them in setting the
 right name and notification lists.
- We suffered a catastrophic hardware failure on the physical machine
 that was hosting the application side of Jira. The service was
 fully-puppetized and relocated to a VPS. More details will be
 published after a post-mortem, during the week of the 16th.

Uptime Statistics
Uptime for this month has been around 99.6% overall. Some issues with
git-wip running out of memory at times have pushed the uptime for this
down to around 98%. We are investigating the issue.
moved to a new host, which also caused a small amount of downtime.
Jira going down hard has not helped.

For more details, please visit:

Github as Master
The GitBox project is pending responses from the pilot projects before
it can continue. We are looking at multiple potential candidates at the
moment for this. Rather than wait on external groups, we will be using
Infrastruture's own website repository for our testing.

Git mirror/GitHub sync process
The sync process between Subversion and Writeable git repositories to and onwards to GitHub has been suffering from missed
syncs lately, at an approximate rate of 1 miss out of every 10 hits. The
process has been improved and the logging also widened, so we can better
analyze any failures that may occur. The upgrades appear to have cut
down on the missed syncs by a large factor. We have, as of this writing,
had 2 missed syncs compared to the 100 or so we usually have, and both
seem to be attributed to timeouts pushing to github. The error rate
going from git-wip to the pubsub system has been reduced from around
10-20 errors per day to 0 by refactoring the pubsub agent. We continue
to monitor the situation and address any bugs that may show up.

Community Growth 2016
Since this is a new year, it might be worth looking back at 2016:
- We had 34 people contributing to our codebase (puppet) for the first
 time in their involvement with the ASF, compared to 12 people who
 regularly contribute to the repo.
- 354 new people have filed an issue with infra for the first time,
 while 329 people, who regularly work on issues, have also been
 contributing to the 2,321 issues created in 2016.
- 9 people who were previously active on JIRA have now started
 contributing code (patches etc) to Infra.

'Costs per project' Project
Unfortunately, I've been a bit swamped with other things
and this hasn't gotten many cycles. Expect more on this
issue next month.

21 Dec 2016 [David Nalley]

This past month has seen lots of activity in reviewing the FY17
"actuals" and projecting our overall FY17 expenditures, within the
Infrastructure section of the budget. These numbers have been
coordinated with the President and with Virtual, and are presented
elsewhere in the December Board agenda.

In short, Infrastructure is forecasting increases in staffing costs,
cloud services, and a small amount related to a once/year gathering of
the team at ApacheCon. On the other side, we're looking at lower
hardware costs as we transition from ASF-owned machines towards a more
flexible posture using virtual private servers in our cloud providers.

Short Term Priorities
- Get one or more projects launched on the Gitbox system
- Training of our new staffers, particularly towards VM migration
- LDAP changes to support podlings, and to integrate it within our
 supported services (eg. Sonar and Roller)

Long Range Priorities
- Move all services off ASF-owned hardware, including the difficult
 process of moving our email infrastructure
- Finish the use of puppet for all services, then explore to move
 towards upgrading to Puppet 4, and/or containers for ongoing service

General Activity
- Finalizing the (emergency) move of the moin wiki
- Continued gitbox work (see separate section)
- Ongoing conversation and development plans for integrating podling
 management into LDAP (and other ASF tooling; particularly, gitbox)
- SonarQube puppetized, LDAP-enabled, and brought up. This was a big
 VM move, and will enable self-service of sonar jobs
- Continued work on puppetizing Roller (blogs.a.o) and moving that
 VM. Testing is now beginning.
- Continued puppet work and VM moves
- Beginning investigation of launching Fisheye services locally at the
 ASF to replace the third-party service run by Atlassian. Numerous
 projects use the service, but it will be shut down in early
 January. We're costing out a local replacement.
- The security-vm is in testing, with a new Jira instance. This Jira
 will be used (privately) for the Security Team and Brand Mgmt.

Uptime Statistics
Uptime for this reporting cycle hit 99.9% overall, with critical
services staying at an impressive 100.0%. The usual culprits, Jenkins
and SonarQube were responsible for dragging down the overall score,
and are being replaced/updated to address this. The moin wiki, which has
been moved and cleaned up heavily has improved immensely, going from a
previous average of 94% uptime to a solid 100% uptime over the past few
months. This weekend (Dec 17-18) we were hit by an outage at NERO,
causing outages for the services hosted at OSUOSL. While the situation
has been resolved, this will reflect negatively on next month's SLA.

For more details, please visit:

Github as Master
The GitBox project is moving ahead as planned, and is generally
considered ready for testing with willing projects. We are in talks with
a specific project for the initial tests, and will discuss onboarding
more projects as this progresses. The services involved have been set up
and fully puppetized, and tests are showing good results here. While
this service depends on LDAP (and thus awaits the upcoming LDAP changes
for podlings), we have modified the system to work with a hardcoded list
of podling members, so we can test with podlings without having to wait
for the LDAP changes.

We have a list of things either to be done or that have been done, at: - this outlines what we were thinking,
where we are, and what remains to be done before we can consider this
service production ready.

We invite everyone to visit and have a look,
perhaps even try out the account linking service and provide feedback on

'Costs per project' Project
As a long-term project, Infrastructure has been tasked with reducing
the "cost per project added into our systems." There has been some
work at the margins of our self-servce tools and processes, to reduce
the staff/volunteer time to provide service and to reduce resource
costs of these services (eg. locating services on more cost-effective
providers). However, we have not made any progress on computing
current per-project costs, projecting those costs, or planning
mitigation strategies.

16 Nov 2016 [David Nalley]

Additions to Infrastructure
Sebastian Bazley (sebb, apmail karma)
Freddy Barboza (fbarboza)
Chris Thistlethwaite (christ)
Christofer Dutz (cdutz)

In addition to the above karma grant; our two new staff members are ramping up.
(Freddy Barboza and Chris Thistlethwaite) Expect forthcoming blog posts from
our three recent hires on the infra blog.

Operations Action Items

Short Term Priorities
- Ensuring we have gathered enough information to begin the Gitbox
- Making sure the new wiki instance is running smoothly
- Introducing new staffers to the infrastructure
- Networking/F2F meetings at ApacheCon Europe
- Further work on rebranding our new web site

Long Range Priorities
- Stand up a service for mirroring repositories and events from GitHub
- Mailing list system switchover is expected to happen in the coming
 months. We are aware of a few outstanding mail-search requests that we
 have concluded could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their
- Deprecate eos (currently only running mail-archives, wiki was moved)
- Work on weaving podlings into LDAP
- Further explore identity management proposals

General Activity
- moin wiki ( was moved to a new, faster box and
 inactive accounts were pruned to greatly increase responsiveness.
- More package consolidation and updates on the Jenkins platform
- Fixed issues with the buildbot configuration scanner not working
 as intended.
- Stood up buildbot and jenkins setup for publishing web sites via git.
- Moved more VMs from vCenter to new cloud locations.
- gitbox (stage two of our github experiment) has been stood up as an
 actual machine, with some services working already. We expect to be
 able to start mirroring and gathering push logs in the coming month.
- In conjunction with gitbox, reworked the overall design of our
 writeable git repository web interface (and received positive
- Reworked policy for new git repositories so that new repositories are
 automatically mirrored and have github integrations enabled by
- Worked on a new web site for infrastructure. Some components can be
 used for projects wishing to switch to a git-based automated workflow.
- Added and fixed a bunch of jenkins build nodes.
- Debugged and (hopefully??) fixed some issues with our pubsub systems
- Consolidation, general sanity checks and harmonization of Jenkins

Ticket Response and Resolution Targets
 Stats for the current reporting cycle can be found at
 The tentative goal of having 90% of all tickets fully resolved in time
 is still being used. We are hoping that the onboarding of new staff
 will help greatly improve these stats.

Uptime Statistics
 Uptime this month was mostly affected by the moin wiki, which we have
 now moved. We expect next month to have a much better uptime than this
 month (99.79%)

 For detailed statistics, see

Github as Master
Sam asked us to report on this process. To date, we've finished our
discussion with Legal Affairs to ascertain the framework that things
have to operate in.

Additionally, we've started discussions on private@ about the ASF-side
of that work. You can see the documentation from that discussion:

We have an instance running in AWS (the "gitbox") to run the above.

Finally, we've identified an initial project to subject to the

'Costs per project' Project
Sam has asked to focus on looking at ways to reduce the straight
linear expansion of costs for each additional project that we add.
To date, this remains a bit of a thought experiment on both how to
measure the costs in a more granular way than we do now. Currently
the data that we have shows a correlation between increase in
requests for service, bandwidth, etc. This correlation has led to
our existing planning/staffing models, but there remains much to be
done on this front.

The goals for next month is to figure out our what specific points
we need to be measuring, and ideally looking backwards to validate
what the actual rate per project is in terms of consumption.

The goal for Q1 is to break that down, and figure out what the
constraints of the current onboarding and ongoing operations are.

19 Oct 2016 [David Nalley]

Operations Action Items:

Short Term Priorities:
- Work on experiment with ProxMox for virtualization to aid
 in moving VMs from the vCenter cluster, slated to be decommissioned.
- Further explorer upgrading puppetised machines to Ubuntu 16.04
- Engage with and onboard new staff once hired.

Long Range Priorities:
- Explore moving podlings to separate LDAP entries, which would also
 make the MATT/GitHub experiment much easier in the long run.
- Mailing list system switchover is expected to happen in the coming
 months. A couple of outstanding tickets are pending, and we have yet
 to design a working redirect, although this is expected to be trivial.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their
- Moving JIRA to a new location, as the old machine is slated to be

General Activity:
- Infrastructure will be present at ApacheCon Seville to interact with the
 wider ASF community.
- Maven backup repository is being moved to a different DC to save cost
- LDAP clusters are being resized to accommodate DC IP shortages
- VPN scenarios being worked on to reduce the number of public IPs used
Ticket Response and Resolution Targets:
 Stats for the current reporting cycle can be found at

 The tentative goal of having 90% of all tickets fully resolved in time
 is currently not being met, but we expect the rate to go up once new
 staff has been onboarded.

Uptime Statistics:
 Uptime stats have been very stable over the past few months,
 at 99.9% for critical services and 99.8% in total.

 For detailed statistics, see

21 Sep 2016 [David Nalley]

A report was expected, but not received

17 Aug 2016 [David Nalley]

Short (and tardy) report this month.


We saw the blog post announcing the open position get reposted to
several remote working job boards and have had a decent response in
terms of resumes showing up. We made a first pass over approximately 1/3 of
the incoming resumes as of this writing and hope to finish that first pass by
end of week. We've seen a number of promising resumes show up and hope to be
able to move forward with them. Hopefully we'll have material updates to
provide to this in the coming week.

General activity

Ongoing work on Jenkins and Buildbot slaves has finally coalesced into being
completely managed via puppet. This is a huge milestone as these are
relatively complex configurations and because of the number of machines it
involves. It's also gotten some critical acclaim[1]. This should make it much
easier for folks to get additional build dependencies installed in a more
self-service manner (simple pull request against the repo compared to filing a
ticket and having that done by infra)

Work is in preparation for migrating our Jira instance off of our existing
hardware. The current hardware is approximately 6 years old, and our instance
has grown so much that the underlying database infrastructure on separate
machines is starting to become a constraint.

Growing on our blocky (infrastructure-wide IP bans for violation of rules
against one or many hosts) we've added a dashboard to both manage the rules
and blocked IPs.

Github experiment

Nothing material to report. Things seem to be largely 'just working'
We are working on adding an incubator project to the github experiment - we
expect this will push the limits of the project as there are literally
thousands of incubator committers, so it should be an interesting threshold to
pass and gauge our ability to cover them.


20 Jul 2016 [David Nalley]

As reported last month, we began, and successfully completed new
contract negotiations with the contractors this month.

We are still suffering staff shortage and that continues to deleteriously
affect many things.

Infrastructure has seen some notable problems. A intermittent Jira outage
affected a large number of users. The folks at Atlassian assisted us in
diagnosing the problem and moving forward.

Additionally one of our VMware hosts has an ailing storage array. While work
was underway to evacuate all of the hosts, this storage malaise has
exacerbated the situation.

15 Jun 2016 [David Nalley]

Operations Action Items:
We've published a job description and are working with the President
and Virtual to publish this widely in efforts to solicit more candidates.

Demand for Infra services

The number of projects that the Foundation is responsible for
continues to grow, and that is placing an ever increasing burden
on demands for infrastructure resources. Today, our largest constraint
is staff members to do the work.

Historically, we've had an average of 33 tickets per month per full
time staff member, and as that average grows we typically add staff.
Today we have 3.5 full time staff members - working, though statically, we
really should be at 5 just to be able to handle the ticket load. (This does not
include any time to focus on larger scale projects)

Given our current rate of growth, addressing tickets alone will require 5.5 full time
staff members by years end, and if we continue at our current pace we'll need to add
1 to 1.5 staff members every 2 years just to deal with continued growth.

Take a look at a graph demonstrating the growrth rates, demand for services
based on tickets, and staff members:

Short Term Priorities:
Some of the automation behind the mechanics of transforming a
graduated podling into a TLP fell into disrepair over the past few
months. This led to many exaggerated timelines for TLP graduation.
Infra held a TLP work day and while it largely remains a manual operation
there is now a runbook that is current for dealing with graduations. And
all of the pending graduations were processed. There is ongoing work
to automate large swaths of that, but for now it should only take
~30 minutes to process a newly-graduated TLP.

Long Range Priorities:

Because of staffing shortage, precious little work has occurred on
our long range priorities. Instead we've moved back to what can
largely be described as firefighting and attempting to keep up
with incoming work as best as can be managed.

General Activity:


We suffered a somewhat longterm outage of the Nexus repository.
We temporarily restored the service, but a service move off of the
ailing VMW infrastructure is planned for the short term future.

More recently, we suffered both a database and VMware-related
outage on the same day. Our VMware infrastructure is on increasingly
brittle hardware. We have been concerted efforts for some months on moving
VMs off of this infrastructure, and continue to do so.

New Contracts
Following up on discussions that occurred at ApacheCon NA, we are beginning the process
of renegotiating contracts for our non-employee staff members.

Uptime Statistics:
Overall we met the service uptime, though on an
individual service basis we did not meet the uptime expectations for repository.a.o.
Please see:

18 May 2016 [David Nalley]

Operations Action Items:


Our staff is down by roughly 36% currently. That has impacted SLAs,
particularly those around response times to tickets.
Staff members have been interviewing a potential new hire.

Short Term Priorities:

Code signing

After initial plans to discontinue this service due to high cost.
we were able to succesfully negotiate a satisfactory contract
with Symantec.

Jira Spam

Our Jira service was again under attack this month. Starting on Tuesday
morning sometime and continuing through to Thursday lunchtime. We have
had to yet again restrict the 'jira-users' group from creating and
commenting on issues. Regular contributors/committers to projects are
still able to create and comment on tickets for projects in which they
are named in 'roles' Most committers can not yet create INFRA tickets or
tickets for other projects in which they are not in a role.

The Infra team have banned over 60 IP addresses via fail2ban triggers.
Nearly 2000 Spam tickets were created over multiple projects. Over 160
user accounts were either deleted or disabled.

We have in place automatic ban triggers for more than one account
created using the same IP address in an hour.

At this moment the restrictions are still in place whilst Infra works
out future options.

Modules created for crowd testing (in progress)

Following a sucessful (quiet) month of having both Whimsy and Traffic
Server using the git-dual system, Infrastructure is contemplating adding
more projects to the experiment, in part to test out different aspects
of the experiment that may not be fully utilized by the current
projects, and in part to increase the load on the service to see if
anything breaks when we start hitting rate limits. Infra is at present
considering adding the Beam incubator project to the test, which would
let us experiment with extremely large groups of people (universal
commit bit etc).
At ApacheCon - we unveiled the new service -
this service is built to replace both, and See more details below.

Long Range Priorities:


No material updates here due to staffing issues.


No material updates here due to staffing issues.

Technical Debt
We've taken the first major step in retiring mail-search and
mail-archives. The former is particularly important in our paydown of
technical debt as it currently runs on hardware that is 8 years old and
reduces the number of operating systems that we are forced to manage. To
boot, much of the mail-search platform is undocumented and the current
Infra staff have little knowledge on how to operate those services.

General Activity:
Staffing issues has made this month a bit more mechanical than normal.

Uptime Statistics:

For detailed statistics, see

20 Apr 2016 [David Nalley]

Short Term Priorities:
- Ensuring/monitoring that the dual git repository setup works as
- Provision an isolated test environment (including isolated LDAP) for
 developing/testing services faster than is currently possible, under a
 separate domain ( We believe this separation will
 also help projects and volunteers work on their ideas, such as the
 Syncope trial, as they can then develop and test without interfering
 with production systems.
- Looking at deploying Apache Traffic Server to help alleviate the
 troubled BugZilla instances and possibly JIRA.

Long Range Priorities:
- Mailing list system switchover is expected to happen in the coming
 months. We are aware of a few outstanding mail-search requests that we
 have concluded could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their
- Further explore identity management proposals

General Activity:
- git-dual has been fully puppetized and is ready for more extensive
- Traffic Server has joined the git-dual experiment, so far without
- Due to severe abuse, we had to take our European archive server
 offline and make extensive restrictions on our US archive. We also had
 to put the US archive in maintenance mode for a few days in order to
 relocate it to a new machine that can better handle the load (and
 isn't 5 years old).
- Following concerns raised on infrastructure@ about the loss of history
 due to the migration of people.a.o to a new host not including the
 contents of committer's public_html directories, an infrastructure
 volunteer stepped forward to automate the copying of this data. Of the
 original ~500GB, around 80% was inappropriate (RCs, Maven repos,
 nightly builds) and was filtered out. There have been no further
 concerns raised since the copy of the remaining ~110GB was completed.

Ticket Response and Resolution Targets:
 Stats for the current reporting cycle can be found at
 The tentative goal of having 90% of all tickets fully resolved in time
 is still being used. Compared to March, we have had slightly more
 tickets opened, and the percentage of tickets that hit our SLA is
 lower, as we are essentially 2 full time staffers fewer than we were in

 Quick April reporting cycle summary:
   - 230 tickets opened
   - 228 tickets resolved
   - 248 tickets applied towards our SLA
   - Average time till first response: 13 hours (up from 6h in Feb-Mar)
   - Average time till resolution: 44 hours (up from 23h in Feb-Mar)
   - Tickets fully handled in time: 90% (204/227)

Uptime Statistics:
 Uptime is not pretty this month, but this is primarily an aesthetic
 issue. The biggest failure here has been keeping LDAP servers in sync,
 as we have a specific LDAP server in PNAP that keeps going 10 minutes
 out of sync with the rest. We are looking into why this happens, but
 so far we have not been able to determine the true cause. Nonetheless,
 this does not mean LDAP has been down or unresponsive, merely out of
 sync on one node.

 As mentioned earlier, we also had to pull the EU archive due to
 massive abuse, primarily from some EC2 instances and other external
 datacenters. We are working with the data centers in question to
 resolve the issue, and have also imposed a new 5GB daily download
 limit per IP on the new archive machine. Preliminary data suggests
 this new limit has reduced the traffic from by as
 much as 65% (going from around 3TB/day to 1.1TB/day), further
 suggesting that the bulk of traffic from that service is to poorly
 configured VMs or other CI systems that should never have been using
 the archive in the first place. As the machine that hosted the archive
 also hosts the moin wiki and mail archives, we believe that these
 services will now perform better, in lieu of archives.a.o moving.

 For detailed statistics, see

16 Mar 2016 [David Nalley]

New Karma:


Operations Action Items:

Short Term Priorities:
- Ensuring/monitoring that the dual git repository setup works as intended.
- Provision an isolated test environment (including isolated LDAP) for
 developing/testing services faster than is currently possible, under a
 separate domain ( We believe this separation will also help
 projects and volunteers work on their ideas, such as the Syncope trial, as
 they can then develop and test without interfering with production systems.
- Looking at deploying Apache Traffic Server to help alleviate the troubled
 BugZilla instances and possibly JIRA.

Long Range Priorities:
- Mailing list system switchover is expected to happen in the coming months. We
 are aware of a few outstanding mail-search requests that we have concluded
 could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission
- Further explore identity management proposals
- Further explore the MATT experiment (see General Activity)

General Activity:
- Finished the initial design of, intended for distributed
 git repositories (Whimsy experiment). Things have been fully in sync for now.
 Some discoveries and gotchas were made in setting this up, such as split-brain
 issues and canonical source requirements for the setup to sync properly - in
 particular, it seems that the 'origin' setting in each repository must be set
 to GitHub for it to sync properly both ways. There is still the issue of a
 discrepancy between emails for commits pushed to ASF and commits pushed to
 GitHub, but this is being worked on.
- Adding new members to our GitHub organisation has been fully automated and
 tied to LDAP. Any committer setting their githubUsername field through will automatically be added as a member of our organisation
 there. We are pleased to say this is/was the final step in fully automating
 the MATT experiment from the committers' side of things. All adding/removing
 of members for github repos is now automated completely, if delayed by a few
 hours due to rate limits (in anticipation of 1000+ repositories, we have
 decided to do slow updates).
- Added JIRA SLA guidelines (as well as a status page on, see
- We had a bad disk on coeus, our central database server, causing many services
 to be slow while we replaced the disk and resilvered the mirror over the

Ticket Response and Resolution Targets:
 We have added a new SLA for tickets. We are still tweaking the parameters in
 this SLA, but the preliminary ones have been applied to our new JIRA SLA page,
 where stats for the current reporting cycle can be found at
 Tickets created on or after February 23rd are counted towards our new SLA. We
 have a tentative goal of having at least 90% of all tickets fully handled
 (responded to and resolved) in time.

 Quick March reporting cycle summary:
   - 178 tickets opened
   - 216 tickets resolved
   - 144 tickets applied towards our SLA (Feb 23rd onwards)
   - Average time till first response: 7 hours
   - Average time till resolution: 17 hours
   - Tickets fully handled in time: 95% (118/124)

Uptime Statistics:
 Nothing out of the ordinary to report here (99.9% across the board). We
 decommissioned the use of minoatur as a web space provider and have moved to This has caused a slight rise in uptime
 for standard services as the old was frequently experiencing

 BugZilla has seen less abuse than before, possibly because of the measures
 taken previously. We are however exploring utilizing Apache Traffic Server to
 further ensure its future stability.

 Once again, the Moin wiki is, as always, experiencing hiccups, caused by the
 general flaw in its design. We urge any project still using this to switch to
 cwiki, if they experience slowness in the service.

 For detailed statistics, see

17 Feb 2016 [David Nalley]

New Karma:

Operations Action Items:

Short Term Priorities:
- Setting up a new git repository server at the ASF for testing git r/w
 synchronization viability.
- Provision an isolated test environment (including isolated LDAP) for
 developing/testing services faster than is currently possible, under a
 separate domain ( We believe this separation will also help
 projects and volunteers work on their ideas, such as the Syncope trial, as
 they can then develop and test without interfering with production systems
- Agreement with Apple has been executed, and projects may now publish
  apps in the iTunes store. Huge thanks to Mark Thomas who has been
  working on this for multiple years and seen it to conclusion.
- Appveyor CI is now available for projects who make use of a Github
  mirror, see for more details:

Long Range Priorities:
- Mailing list system switchover is expected to happen in the coming months. We
 are aware of a few outstanding mail-search requests that we have concluded
 could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission
- Further explore identity management proposals

General Activity:
- Moved the STeVe VM to a new data center in anticipation of the upcoming annual
 members meeting. Benchmarks show it could serve an election with more than
 6,000 voters, so it should be adequate for the upcoming election.
- The new machine for serving up ASF's end of the ASF->GitHub r/w repositories
 is under work, but has been slowed a bit by some difficulty in integrating
 new hooks as well as difficulty with our current environment setup (see SRPs
 above).  As stated in the SRPs, we have discussed deploying a separate test
 environment that is isolated from our production environment and would allow
 for faster development/testing. The project hinges on three phases: ACL,
 email/webhooks and dual r/w repositories with sync. The ACL has been
 automated now and is working. The email phase has turned up some faults in
 GitHub's way of determining whether a commit is unique (and thus deserves a
 diff email), and thus we are working towards replacing this with a new
 mechanism. This new method however is dependent on getting the 2xr/w repos
 up and running, which in turn depends on rewriting the git hooks we have in
 place for our current setup, and is the cause of the current slow progress.
 We expect to have a basic work flow with new hooks in the coming week, and
 once that is confirmed, we'll attach the Whimsy code to that process.
- Work continues on moving machines from our old VM boxes to our new provider.
- Fixed some issues with buildbot configurations not being applied.
- Lots of JIRA activity: 175 tickets created, 146 resolved
- Fixed spam filtering cluster so that it runs checks correctly and is able to
 catch more legitimate spam. This has already proven very effective.
- 5 machines and two arrays decommisoned and removed from OSUOSL racks.

Uptime Statistics:
The new Whimsy VM has implemented a status page that we have tied into our
monitoring system (PMB). This is still not considered a production environment,
and thus does not count towards our SLA.

This month was a slightly bumpy ride for some services. While the critical
services continued their nice trend of pretty much 100% uptime, some core
services such as have acted very unstable, indicating that we
may need to speed up the decommissioning of this service.

BugZilla has been abused by some external services trying to scrape
everything, causing some downtime. We have implemented a connection rate
limiting module (using mod_lua) for the BZ proxy and this seems to have helped,
albeit not stopped the downtime competely.

The Moin wiki is, as always, experiencing hiccups, caused by the general flaw
in its design. We urge any project still using this to switch to cwiki, if they
experience slowness in the service.

For details statistics, see

20 Jan 2016 [David Nalley]

Things were quieter this month than normal, largely due to holidays.

General Activity:
- Added new nodes to our LDAP cluster for more robustness, starting work on
 adding LDAP load balancers to prevent abusing specific nodes.
- Work continues on moving machines from our old VM boxes to our new provider,
 as well as formalizing existing setups (cinfig mgmt etc).
- GitHub organisational changes were applied, in order to continue with the
 exploratory MATT project. Committers can now be automatically added and/or
 removed from GitHub teams depending on LDAP affiliation and MFA settings.
- 120 JIRA tickets created, 156 resolved

Uptime Statistics:
Continuing the positive trend since November, we are very pleased to report that
uptime across all SLA segments for this reporting cycle was above the famous
'three nines', and the overall uptime - when using the same number of decimal
places as most service providers - was at a marvelous 100.0%. That is not to say
we did not have our share of service glitches - in fact we had quite a few - but
the duration of these (a few minutes each) were too insignificant to budge the
total uptime figure.

In order to compare better with places using various decimal place settings, we
have tweaked our SLA page to accept this setting as an argument, thus: will show uptime as XX.Y% whereas will show it as XX.YYY% etc.

There are still places that we cannot or do not yet monitor, more specifically
various components in whimsy - mostly due to access restrictions -, and we are
in talks with people from the Whimsy project about utilizing a new status page
for our monitoring.

Whimsy Github Experiment
We've done some work around MATT (Merge All The Things - and are relatively happy with the workflow around
definitively identifying Github and Apache accounts and merging them.
(Our process allows for folks to sign into their Github account and the Apache
account and confirm the other.

That work has been followed up with some automation work so that:
We can create groups on the fly, automagically populate them from
group membership in LDAP(predicated upon the person identifying their
Github ID with MATT), and then grant them commit access if they have MFA
(Multi-factor authentication) enabled. We've decided, at least for the time
being to mandate MFA - as we have no visibility into authn failures or other
auth attacks, and the MFA ability grants us better security than we have on
our own hosted repositories. (Currently ~1/2 of Whimsy committers do not
have commit access to the repository because they have not enabled MFA.)

We still have work today on the Github experiment, mainly around the
automation of pushes back to an ASF copy of the repositories.
We should have more to report on this front next month.

16 Dec 2015 [David Nalley]

New Karma:
None this month


Operations Action Items:

Short Term Priorities:
- The M.A.T.T (Merge All the Things) project for Whimsy is progressing, albeit
 hindered a bit by the need to change our entire setup on GitHub. We expect
 this to be solved within the next reporting period.
- We discovered an error one of the monitoring systems we use (a faulty
 configuration) which had been preventing it from notifying us of some issues
 such as a bad disk. This has been rectified and hardware replacement should
 happen before the board meeting.
- We are in the middle of moving a lot of VMs away from our old vcenter setup to
 the new infra provider, LeaseWeb, as well as upgrading them, which will clear
 out a lot of technical debt.

Long Range Priorities:
The mailing system switchover is scheduled to happen within the coming months,
with tests being performed at the moment to port lists form the old ezmlm system
to the new mailman 3 system. We are preparing to stress-/scale-test the new

We are looking into replacing our current backup plans with more affordable ones
while still retaining a hardware-managed solution. This will hopefully cut our
backup costs in half in the long run.

General Activity:
* Lack of responsiveness
 We've noted that our responsiveness to email on infrastructure@ has suffered
 of late, resulting in dropped work items and failing to meet expectations.
 While we certainly don't want to manage work via email (far too much of it)
 it's clear that a number of issues were being lost in the sea of email. We've
 reinforced that the responsibility of insuring we are responsive to emails
 falls to the on-call staff member of the week, and hope to see this situation

* moving to
 There has been some discussions going on about the decision to decommission
 the current server and move to a new machine. Most
 discussions have revolved around specific methods of access (rsync, scp, sftp)
 and some poeple have asked why the old data was not copied over verbatim. We
 are looking into whether RSYNC and SCP can become a reality - so far, our
 search has shown it's not an easy task to couple those with LDAP) We are not
 actively looking into copying everything from minotaur (the old host), as it
 would fill up most of the disk space with unnecessary files (very low signal
 to noise ratio).

* code signing service
 Discussion has begun about discontinuing the code signing service. Despite a large
 number of projects requesting the service, to date, only two projects are making use
 of the service. Moreover, in the past year, we've had a grand total of 34 signing events.
 The conversation on whether offering a code signing service remained pragmatic with
 such low uptake began on infrastructure@ - Symantec was notified that we plan to
 discontinue the service, and has asked to be allow to submit a less expensive option
 for our consideration.

Uptime Statistics:
The November-December period has been extremely smooth sailing with uptime
across the board reaching the fabled 'three nines' (>=99.90%) and uptime for
critical services nearly hitting 100%. For more in-depth detail, please see

18 Nov 2015 [David Nalley]

Operations Action Items:
Infrastructure discovered that many projects were using git in an
innovative manner. The majority of these uses bypassed the
normal expectations that we had set about protected branches and
tags. As a temporary measure, to get us back to the same level of
assurance as expected, we disabled the ability to delete branches
and tags. This has caused a bit of murmuring as it is disruptive to the
way that many projects use git. Infrastructure awaits guidance
about policy around VCS, history, etc.

We've added a new cloud provider, Leaseweb, that should provide
some additional capacity in the US and Europe.

Short Term Priorities:
We've had an increased focus on tickets in the past month.
Tickets are being created faster than we can deal with them
when viewed from a monthly perspective. Hopefully as we are
adding additional capacity we'll address this and get it back
under control.

Long Range Priorities:
Automation hasn't been at the top of the list this month.
Nonetheless we've made some gains in this arena. We've added
a good deal of the generic build slave and buildbot master
configuration to puppet.
We've also puppetized the new home.a.o service.

The past few months has given us a good opportunity to prove out our backups
and ability to respawn infrastructure. We've gone through multiple moves of
critical systems, most notably SVN; proving that our backups work as intended
and that configuration management works well for those hosts.

Technical Debt
Web space for committers (currently hosted on is being
moved to a new home, aptly named, in the continued effort to
phase out the minotaur server. Committers will be given 3 months to move their
contents, after which will stop serving personal content and
redirect to We have opted for asking people to individually
move their content due to the sheer amount of potentially unwanted old files
that currently reside on minotaur (taking up 2TB of space). This will be a web
hosting server only, shell access will not be allowed. We plan to set up a VM
for PMC Chairs later on, for performing LDAP operations, but the free-for-all
approach that minotaur has is unlikely to be offered in the future, due to the
unmaintainability of it.  Once this is in place, the only other major
component we need to remove from minotaur is our DNS service, and we can then
retire the aging machine.  We will be publicizing the above very broadly in
the coming weeks as we start the countdown clock.

Monitoring / Logging
We have had to retire our initial unified logging cluster due to the
incredibly high amount of logs coming in (estimated 20 billion entries per
year), which was choking on disk read/write speeds, first and foremost. A new
5-node cluster with faster SSD disks has been put in as a replacement,
requiring less than 24 hours to set up and put into production thanks to our
configuration management systems and some snappy work done by the team. This
new cluster is henceforth known as 'Snappy'. We will most likely be cutting
down retention time to 3-4 months as a precautionary measure, so as to only
store somewhere around 5-6 billion records at any given time. As we mainly
need "current" logs (within the last month) for our work, this is an
acceptable compromise for us, considering the alternative would be more than
doubling the cost of the cluster.

General Activity:
On-boarding the newest member of the team is ongoing, and has proven to be a
very smooth task, with every member of staff pitching in to provide guidance
and help.

The mail archive PoC has been on a hiatus, with the two lead designers being
away on travel at various times, but as the systems have been running by
themselves in the background, they have provided valuable debugging information
for optimizing and further developing them. We remain convinced that we are on
the right track in terms of software to use for the next generation of mailing
lists, as we have not seen sufficient evidence of other alternatives operating
at the same scale the ASF does.

Uptime Statistics:
See for details.

The critical services SLA was not met this month due to an LDAP outage that
caused our "LDAP Sync" checks to fail for an extended period of time. While the
LDAP service was not itself unavailable, we will be looking into better ways to
ensure that we act faster when nodes go out of sync or stop functioning. The
overall SLA was however met. In practice this had almost no effect on users, but
didn't meet our own internal standard.

21 Oct 2015 [David Nalley]

Reviewed by:

New Karma:
Infrastructure has added Daniel Takamori as a part-time staff member.
You can read more here:
This is a backfill of the part-time contract position that expired in
May of this year.

I spent some time talking with the Virtual folks about our rate of
expenditure in some specific GL Accounts - those used for replacing
hardware, cloud infrastructure, and build farm costs. While we are
currently close to plan in terms of totals, the rate of spending has
increased dramatically, and failing either a reduction in spending,
or in-kind sponsorships (one of which we are working on), we will
be over budget in those specific categories. That said, we are
significantly underbudget overall

Operations Action Items:

Short Term Priorities:

Long Range Priorities:
We've made small, regular strides in increasing our automation efforts,
 but nothing particularly report-worthy.

Much work around backups, both improving the scope as well as the quality
of our backups is ongoing. We are backing things up, but need to figure out
a retention policy, and that work is yet to be done.

Technical Debt
We've made significant strides in migrating workloads off of aging machines.
While we have a long way to go, progress is occurring. Most notably by end
of month we should have everything in place to retire the last remaining
machine at our Florida Colocation facility and cancel that contract.

After last months significant gains, we are seeing benefits brought on
by monitoring, but not necessarily additional gains in the monitoring.

General Activity:
ApacheCon Europe happened in the past month, and a number of Infra
volunteers and staff members attended.

We suffered a second LDAP outage, that required editing corrupted
data and restoring it. This caused a brief outage.

Mail system proof of concept work continues, though the largest
blocker at the moment is ensuring that a broader-than-the-ASF
community cares about Ponymail.

Uptime Statistics:
See for details.

Despite some needed LDAP maintenance that caused a brief outage, overall
we met or exceed all of the SLAs this month.

16 Sep 2015 [David Nalley]

Operations Action Items:

Short Term Priorities:
We experienced what we believe to be a DDoS attacking the mirror
redirection CGI script this month. This drove our 15 minute load
average to 2700+ We ended up using a much more efficient redirection
script, and redirecting all queries to the old one to the new to
resolve the issue, and it took us around 12 hours to mitigate the

During the reporting period we lost an LDAP server inadvertently, and
this caused a number of services to cease being useful. However, our
alerting detected the problem in a timely manner, and thanks to our
resilient architecture and configuration management, we were able to
provision a new host and have it working again within 12 minutes. A
12 minute Mean-Time-to-Recovery is a stunning statistic.

Long Range Priorities:
See the monitoring section for details on how we've automated the
blocking of abusive traffic.

We didn't add much in the way of resilience, but we have a great
example of how our resilience allowed us to quickly recover from
a failure. See the short term section above.

Technical Debt
See details of mail in the General Activity section

This month a lot of investment in monitoring over the past quarter has come
to fruition.

First - we have finally promoted centralized logging into production. This has
given us tremendous additional insight thanks to the visualization and query
tools that are now available.

We did run into some scaling issues with the 'preferred logging tool' and were
able to move to a very simple python-based log-ingester, that works on both our
new puppet-managed machines as well as our legacy machines.

Once we had data in place, and ability to run analysis on the fly,  we
immediately saw a number of situations where our services were being abused.
Eventually, we determined that we could programmatically deal with a number of
these issues, across all of our machines. To that end, we've now deployed a
tool called blocky that, based on input from our logging system automatically
blocks IP addresses across our entire infrastructure. We have a catalog of how
blocking this abusive behavior has dramatically reduced bandwidth usage, in
one case 30% of a server's total bandwidth was caused by abuse.

In addition to the technical benefits, we can also provide insight to projects
and fundraising for how much traffic is visiting our web properties, or even a
specific project's site, and where that traffic is coming from and what they
are doing most often.

General Activity:

Mail Phase 2

Mail has been interesting, we went from a very promising POC, to realizing at
least one component would likely not be able to scale to match our historical
load, much less be able to scale to the future. In general, while we don't
currently see any blockers, we are hearing of troubling experiences from

In response to that, we've developed a prototype of a replacement called
Ponymail.  It can certainly handle the load. Our plan is to move this software
to the ASF, and ensure that it can develope a community around it. We've
already called attention to it with some similar organizations who are going
through mailman3 POCs.  We will not adopt this software if a community of
folks other than ASF Infra who cares and helps to develop this software. We
want to make sure we aren't replacing an aging system with additional
technical debt that will come back to haunt us in 5-7 years.

Uptime Statistics:

We met this months service level expectations

19 Aug 2015 [David Nalley]

New Karma:

LastPass: $168
Dinner meeting with Lance from OSUOSL $65.50
Cloud Services: $4619

(The cloud services cost was somewhat inflated by the fact that we purchased
reserved instances (roughly $1k was due to this.) However, it reduces our long
term payout somewhat dramatically)

Operations Action Items:

Short Term Priorities:
- Sort out mail archives not updating for certain lists
- Figure out and solve emails supposedly being denied from certain senders

Long Range Priorities:
- Mailman/HyperKitty proof-of-concept, implement ASF account design into it
- Revisit unified logging with a new machine setup
- Fully deprecate the remaining few large non-config-mgmt boxes

General Activity:

Mail archives
An issue has arisen where certain mailing list archives have not been
updating with emails from August. The issue has been narrowed down to the
mod_mbox list database not updating, despite the raw data being present on
the archive servers. While the investigation is not complete, we have
uncovered some permission and shell environment issues that are related to
it, and expect to be able to solve this issue before the next report.

infra@ ML split
It has been suggested, and agreed upon, that the infrastructure mailing list
be split into a development and a commit part, so as to not burden people,
who are only interested in the former part, with the latter. This split is
expected to be implemented in the coming weeks.                   ->               ->               ->  ->                    ->

The root@ address is not a list (but an alias) and will remain so.

We intend to keep the old addresses working, but forward them to the new
list addresses. Also, as infra@ is currently privately archived we will not
make those archives public. Going forward however, they will of course be
made public.

Mailman3/Hyperkitty update
Work has been progressing, and the most recent activity has been a discussion
on how we will implement authentication and maintain the concept of
/private-arch/ (where foundation members are entitled to interrogate any
private mailing list). This feature is fairly unique to the ASF and as such
no mailing list provider has this capability natively.

The PoC has been stood up, and is accepting mails for a couple of test domains
as soon as the platform is ready for people to look at it, and test it,
further information will be shared at that time.

We're now using Datadog as a SaaS monitoring service. It's done a decent job
of getting us a decent baseline of metrics. This has given us an increased
level of visibility, at least into the systems that are managed by
configuration management.

Uptime Statistics:
Going forward, uptime statistics, as they relate to our SLAs, have now been
fully automated and can be found at:

While primarily done to save us from using 3-4 days of contractor time on
these statistics every year, we also felt that there were no compelling
reasons to not have this publicly available ahead of report time, as both
our SLAs and the uptime data itself have (technically) been publicly available
for a long time in its raw form.

To sum up quickly, all service level goals have been reached for the past
5 reporting cycles:

Cycle:     | Critical (>=99.5%) | Core (>=99%) | Standard (>=95%) | Average
Mar - Apr  |   99.52%           |  99.81%      |   99.14%         |  99.58%
Apr - May  |   99.83%           |  99.96%      |   99.22%         |  99.75%
May - Jun  |   99.72%           |  99.67%      |   99.32%         |  99.59%
Jun - Jul  |   99.50%           |  99.13%      |   99.65%         |  99.34%
Jul - Aug  |   99.50%           |  99.87%      |   99.80%         |  99.77%

Contractor Details:

15 Jul 2015 [David Nalley]

New Karma:

$2930 - Rackspace cloud
$60 - Dotster
$66 - Hetzner
$3336 - AWS

Operations Action Items:

Long Range Priorities:


While the general spread of configuration management continues,
the major increase has been with adding Confluence as a puppet
managed service. The Buildbot master and Jenkins master have
started the move to being completely managed by configuration management.


This month we've reduced the disparate number of SSL proxies. We
now have three, identical, SSL end points, all capable of serving
the same content. With some upcoming work around GSLB, this should prepare us
to have an additional level of resilience for the many
services behind the proxies.


We didn't make much progress on this front this month.

Technical Debt

Our GeoDNS instance failed this month. This service routes
users to the closest web server, SVN deployment, or rsync host based on where
they are in the world.  The underlying host went offline and we were
unable to resurrect it. We are awaiting assistance from smart hands
in Europe to do so. In the meantime we've disabled the service, which is
shunting all users to single instance of thse services. We'll be
moving this service to an external provider with substantial
redundancy in the future.

General Activity:


Infrastructure has run two bugbashes in the past month. Over the
course of both days, 39 bugs were resolved.

Buildbot Security Issue

As noted in the blog, we were alerted to some aberrant
network traffic originating from our buildbot master.
While we were able to fix the underlying issue we
also realized that an abundance of caution dictated that we should rebuild the
machine. We also were approaching the EOL of the hardware. This led to the
decision to take the pain and rebuild it. As of this writing the post-mortem
on the incident hasn't occurred.
But, findings will be reported when it happens.


Our Jenkins master has been suffering from disk I/O issues and we
opted to change the underlying file system. While we
were down for that operation we took the opportunity to begin the process of
puppetizing the host. The restoration of data
to the host took much longer than planned, but the final
outcome appears to be performing much better.

Uptime Statistics:

Contractor Details:

Gavin McDonald

- Oncall Duties: remove some snapshots on hades to free up some disk space.
We are only keeping around 1 months worth of snapshots on all repositories now
due to the space issue.
- Upgrade Confluence Wiki to latest version - documentation created in cwiki
- Migrate Buildbot to a new buildbot-vm - documentation created in cwiki
- Begin work on and local testing of buildbot master module
- 66 Jira tickets worked on closing 53 - some longstanding CMS related tickets

17 Jun 2015 [David Nalley]

New Karma:

$3150 - cloud related expenses
$17.49 - domain renewals

Operations Action Items:

Short Term Priorities:

The issues reported last month seem to have been dealt with
earlier in the month.

Disk usage on our SVN host became an issue. We attempted a
service migration to a cloud-based host with more storage.
We migrated the service and then discovering problems that
weren't found in testing; rolled back.

Long Range Priorities:

Moving to RTC for our configuration management code has been
an interesting exercise, it certainly hasn't found all of our
issues but is forcing cognizance of what and how folks are getting
things done. In addition we are seeing a number of issues fixed
during review, before they get pressed into service.

We made little progress on improving our resiliency this month.

Technical Debt
We made little progress on reducing our technical debt.

This month has seen a sharp rise in false positives in monitoring.
This naturally adds to the workload of the oncall person, and is
frustrating. Work is ongoing to reduce the amount of false positives
that wakes folks up.

General Activity:
As of June 1, Chris Lambertus and Geoff Coreys are now
employees within the Virtual PEO.

Uptime Statistics:

20 May 2015 [David Nalley]

New Karma:

$3013 - related to ApacheCon
$422 - hardware replacement
$2350 - Cloud expenses
$60 - Registrar fees

Operations Action Items:
This month has seen work around onboarding the two US-based
contractors as employees. This involves background checks,
reference checks, as well as paperwork. That looks to be
largely complete. It is likely that the two US-based contractors
will be employees effective June 1.

Short Term Priorities:

Git-based websites are now possible, and seem relatively popular.
8 projects are now using the gitwcsub services, and we are
receiving ~2 requests per week.

SHA-1 based chains have been deprecated. Chrome and Windows now
show alerts for SHA-1 based certs, or certs with SHA-1 certs in the
chain.  Due to this we spent a good amount of time swapping
out SSL certs this month. Despite switching out, this hasn't been
completely trouble free. Some git binaries for Mac or Windows seem
to be having difficulty, and this is a problem that continues to be

Long Range Priorities:

This month has seen some automated testing enabled for our
configuration management repo. In addition, we've moved to RTC
from CTR for most changes - while this isn't final, the change
at this point appears positive.

The big boost this month in resilience comes in the mail
infrastructure. See the comments in the mail section.

Technical Debt

While most of our monitoring has been working we continue to
have issues on our legacy systems that leave us without an
ability to monitor. Additionally, some of our applications
are needed deeper introspection for specific functionality,
and we have yet to cope with that.

General Activity:

This month has seen the culmination of phase 1 of our mail
overhaul. Many thanks to the SpamAssassin community and
Kevin McGrail in particular for providing insight into
what they see as best practice. Phase one is focused on our
MXes, spam and virus processing. In some ways, we've overbuilt
the current deployment - our new architecture can scale horizontally
allowing us to handle many times our current mail load.

Much work continues to be done on this front, and should be seen
as phase 2 materializes.

Uptime Statistics:

Overall, the total uptime for 2015 increased by 0.02% this month, in what has
generally been a quiet month in terms of emergency maintenance and downtime.
The few services that performed badly this month have been mentioned in
earlier reports, and steps are being taken to increase the availability of
these services in the long run.

Type:                Target:   Reality, total:  Reality, month:  Target Met:
Critical services:   99.50%             99.62%           99.83%      Yes/No
Core services:       99.00%             99.88%           99.95%      Yes
Standard services:   95.00%             98.96%           98.91%      Yes
Overall:             98.59%             99.57%           99.64%      Yes

Contractor Details:

Gavin McDonald

- Oncall Duties: Whimsy died 04/24 and needed a reboot. Crius disk space reached
- Some buildbot slaves offline, look into issues and bring back online.
- Look into various projects Buildbot failures and resolve, liaising with the
 projects as necessary.
- Reviewing and Merging others branches to deployment.
- Crius and Hemera were 100+ packages behind, 3/4 of which were security
- Updated both then enabled auto-patching of security.
- More work on blogs_vm module, with others checking and merging
- Setting up a more robust local puppet module testing env
- Investigate and fix PagerDuty alerts for various services
- Investigate Moin Wiki ongoing issues and start to compile a report.
- Reduce Crius disk space to 70%
- Clear space on analysis-vm (sonar)
- 57 Jira tickets worked on closing 40
- Various Cron mails looked into and resolved (mainly new cert errors)

22 Apr 2015 [David Nalley]

New Karma:

$53 domain renewals
$2165 Cloud Services

Operations Action Items:

Short Term Priorities:

We've increased our spread of LDAP so that all of our cloud regions now
possess an LDAP server for authentication to work over the local network.

We've generated a FAQ to catalog issues and questions that came up during
the RFC. We hope to have some direction during or shortly after ApacheCon.

Long Range Priorities:

We've been working on a number of automation efforts. One of our
recent deployments of LDAP has proven we can deploy a new host in
less than 10 minutes from provisioning to functional service.
Additionally we've been working on moving more services. One of
the highlights this month includes the SteVe deployment. We now
have the service in a state where it is trivial for us to deploy
a fresh machine with a STeVe deployment for projects to use.

We've begun moving VMs off of our internal VMware deployment. At the
same time we have been working on spinning up our VMware-based cloud
deployment at PhoenixNAP.

Technical Debt
We suspect (but are unable to prove) that some of our VMware issues
are related to using EOL/EOS software from VMware that requires us
to run the deployment for at least one machine in a suboptimal

While monitoring has been useful in identifying issues, we haven't
materially expanded its use this month, something we hope to remedy
this month.

General Activity:
We had a total of 98 incidents in the month of March that we alerted on
and paged the on call contractor out for. The relatively high number has
caused some concern, and we are tracking ways to reduce the number
of alerted incidents to the truly severe.

A number of Jira imports have successfully been handled. These include the
plethora of Maven project imports, Tinkerpop, and Groovy among others. The
bulk of this work has been done by Mark Thomas who deserves special thanks
for tackling this work.

VMware hosts
One of our VMware hosts has been suffering from intermittent network
failures as well as the entire machine dying. This originally appeared
to be time related. We've begun migrating services off of this host in
order of priority, but the outages have affected a number of important
services to include the git services as well as blogs.

Uptime Statistics:

Contractor Details:

Gavin McDonald

Engage with OpenOffice community regarding a few Infrastructure related issues, such as the Mac Buildbots

Revisit and engage with CouchDB PMC to sort out 2 outstanding Domain Name
transfers.  Sort out couch.[org|com] domains with Dotster, Domains are
successfully transferred.  Next step is to sort out secondary DNS and httpd

Covered on-call whilst folks were travelling to ACNA

Jira Tickets: 68 worked on and 56 competed as of 04/20/15

Confluence Wiki migration was completed.

Began work on moving the Blogs service to Puppet 3 and a new home in the cloud.
Began work on preparing to migrate TLP playgrounds to Puppet 3 and the cloud.

Prepared a draft policy covering VCS canonical locations.

on 4/17 qmail stopped sending mails and the queue went above 100000. Resolved
issues and service running normally again.

commonsrdf cms setup refuses to create a staging site. Looked extensively into
the problem but remains unresolved at this time. This is a blocker for the
project as they have no website and are waiting on it to do a release. See:

18 Mar 2015 [David Nalley]

New Karma:

$135.40 -
$285.00 - Silicon Mechanics
$1267.11 - Amazon Web Services
$17.49 - Dotster

Operations Action Items:

Short Term Priorities:
There have been several signing events in the past month.
4 total signing events for Tomcat 8.0.19 and 8.0.20.
1 total signing event for OpenMeetings 3.0.4

Machine deprecation
We continue to make progress in moving services off of some
of our oldest hosts.

VMware issues
We experienced repeated network failures on one of our newer
VMware hosts. This took down most or all services on the host
repeatedly. This appears to be related to time issues, and seems
to have settled down, though we continue to watch it closely.

The new backup service is moving along well. 7 of our hosts are
currently backed up by the new service. Our goal is to have all
of the hosts backed up by end of the year and deprecate our
Florida colo and the machine running there.

Long Range Priorities:
We've made good progress on a number of automation fronts.
The work to automate Confluence has been done, and the service will
move hosts in the coming weeks.
A number of our SSL Endpoints have now been puppetized, as have services
like asfbot, svngit2jira, and gitpubsub.

We haven't made a lot of progress on this front in the past month aside from
the ongoing automation efforts.

Technial Debt
We've run into a bit of technical debt surrounding CGI pages. This had a
deleterious affect on a number of projects. Several pieces of debt from
this were discovered. The first was that a contractor had been manually
setting the executable bit across hosts. The second was that the CMS was
not picking up and communicating permission changes even when set
appropriately. There had been a custom CGI module written in the past, and
we initially spent a large amount of time trying to get that to work in our
new environment to much frustration. In the end, we decided that having to
bear responsibility to maintain a custom module was more technical debt that
we were amassing. In the end we set the executable bit for a number of
projects by using svnadmin karma. We believe these issues to now be resolved.

In the past month we've uncovered a few holes in our monitoring that has
resulted in users pointing out problems to us. We are still working to
fill those holes.

General Activity:

As mentioned in last month's report; we are struggling to find a solution
for the CMS which is on an aging physical host. We've sent a request for
comments on some of our proposed solutions to PMCs and operations@

Uptime Statistics:

Overall, the total uptime for 2015 fell by 0.05% this month, mostly
contributed to by some issues with one of the VMWare hosts, as well as growing slightly more unstable (though nothing alarming).
We currently have two services with uptime below the SLA (
with 98.83% and the moin moin wiki, with 93.31%), while the remaining 26 in
the uptime sample are above the SLA. The bad performance of the wiki is due
the fact that the moin moin wiki was not designed to scale well, thus
resulting in a lot of >30s response times - not errors per se, but still
counted towards its uptime.

Type:                Target:   Reality, total:  Reality, month:  Target Met:
Critical services:   99.50%             99.65%           99.44%      Yes/No
Core services:       99.00%             99.83%           99.84%      Yes
Standard services:   95.00%             98.95%           98.82%      Yes
Overall:             98.59%             99.55%           99.47%      Yes

Contractor Details:

Daniel Gruno:

 Non-JIRA activities:
   - Moved (and puppetized) services from urd (baldr) to a new host:
       - gitpubsub
       - svngit2jira
       - asfbot
   - Weaved into Marvin
   - Helped debug and fix issues with Marvin not working
   - Fixed some issues with the URL shorterner (bad regex)
   - Updated documentation for Git services (git-wip + mirrors)
   - Fixed API (3rd party change) for our status page
   - Fixed an issue with the project services monitoring not firing off
   - Set up a replacement for aurora for www-sites.
   - Experimented with the CouchDB project on a new gitwcsub service for web
     sites, allowing projects to use git for their web sites instead of svn
   - On-call duties

Geoffrey Corey:
 - Resolved 31 JIRA tickets
 - Fix mail relaying from hosts not in OSU network (bugzilla host)
 - Work with AOO to get back missing buttons/options in their bugzilla
 - Work on back log of git related JIRA tickets
 - Work with GitHub support to turn back on the mirroring service
 - Create mysql instances in PhoenixNAP for new VMs in there
 - Migrate pkgrepo to bintray (still need to figure out GPG signing in a
   proper way)
 - Work with Gavin on the Puppet 3 confluence wiki module and its
 - Create puppet manifest for erebus-ssl terminator/proxy rebuild and finish
   up testing for it
 - On-call related duties

Chris Lambertus:
 - time off due to personal/child care
 - On call duties
 - General assistance to other contractors
 - Troubleshooting and diagnostics for ongoing lucene git issues
 - troubleshooting and diagnostics for ongoing eirene VM host issues
 - completed base functionality of zmanda project
 - resolved issues with zmanda+vmware connectivity
 - resolved issues with abi reaching storage limits (add monitoring)
 - extensive tuning of zmanda system

Gavin McDonald
 - Worked on 30 tickets closing 23
 - More Confluence wiki puppetesation
 - Migration work of Confluence to a new new home (99% complete.)
 - Various other general support, looking at VM issues

18 Feb 2015 [David Nalley]

New Karma:

17.49  - Domain name renewals
979.12 - Amazon Web Services

As a side note, we expect to spend dramatically less on hardwware than
we originally budgeted. This difference is coming about due to a number
of issues: Our build farm has been dramatically subsidized thanks to
Yahoo who have provided ~30 physical machines (a long with hosting the
hardware and providing smarthands support). We also now have multiple
cloud providers giving us extensive credits, and we are moving a number
of services to public cloud providers. This does mean a slight change
from capital expenses to operational expenses, though I don't think that
from our perspective that it matters much.

Operations Action Items:

Short Term Priorities:

Tomcat generated two signing events this month, both for Tomcat 8.0.18

Machine deprecation
We've made significant progress in moving services off of some of our oldest hosts.
In doing so we've also have spent a good chunk of time automating these
services and making our deployment more robust. See more on this issue
in the section on Automation in Long Range Priorities.

We are just now beginning the deployment of a new centralized backup service.
The client installation as well as the server has been autoamted, expect
to see more in this space in the coming months.

The LDAP service has largely been rebuilt from the ground up. For
background, the old machine that formerly served as our svn master was
also one of our LDAP machines, and when it failed, we were down to one
very poorly performing instance in the US. Subsequently, we've rebuilt
the entire service using configuration management, we again have two
LDAP hosts in OSUOSL, that are easily handling the load. This has sped
up many of the authn/authz actions that were slow last month. We've also
deployed new LDAP instances to several of our cloud zones. While the
services seems to be working well at this point, we did have a few
hiccups, where the old LDAP servers were removing newly created
accounts. The older LDAP servers were left in place after we repeatedly
found a number of services that had either minotaur or harmonia
hardcoded as LDAP servers. Because of these problems we removed the
legacy instances, and are dealing with the problems caused as we find

Our three Bugzilla instances were still running on a 6 year old machine
but have since been successfully puppetized and migrated to VMs in one of
our cloud accounts. In the process we've worked fastidiously on improving
the software deployment mechanism (software is now deployed as an
OS-native package).

Long Range Priorities:

Significant progress to report this month. As indicated above, LDAP
machines are all under configuration management. Additionally, we migrated
all three of the Bugzilla instances and the git repositories to being
completely managed by configuration management. After finding some
problems with some of our CM-managed instances, we've adopted a process of
destroying and recreating a service as a verification step prior to
pressing a service into production status.
Naturally the new services we are bringing online, like the backup service
are all managed under CM.

Technical Debt
The move of LDAP has uncovered a lot of hardcoded values and we've been
working to pay that off. (referring to service names rather than
specific machine identities, putting configuration in configuration
management where possible.

General Activity:

As of January 30th, we enabled Cassandra's debian repository on bintray
and have been monitoring it closely. The service seems to be working
well and appears to have fulfilled the needs of reducing our overall
webserver traffic and has the bonus of giving more insight into the
downlaods. You can see some of the statistics on the dashboard here:


A lot has been happening around Maven this period.
We've successfully been able to sync a copy of the Maven central
repository, and are working to provide access to that store for the Maven
Additionally, Mark Thomas along with folks from the Maven PMC have been
working on migrating the Maven contents of the Codehaus Jira instance to
the ASF instance. Good progress has been made here, but much remains to be

Uptime Statistics:

We experienced an issue where was reporting erroneous uptime
statistics due to two unused LDAP checks, which unfortunately made it into the
weekly ASF blog posts. Other than that, services have been running fairly
smoothly aside from the Moin Moin Wiki which has experienced high load times
for a couple of weeks. We are discussing what to do to remedy this. Overall,
the total uptime for this year grew by 0.03%:

Type:                Target:   Reality, total:  Reality, month:  Target Met:
Critical services:   99.50%             99.78%           99.82%      Yes
Core services:       99.00%             99.83%           99.92%      Yes
Standard services:   95.00%             99.02%           98.93%      Yes
Overall:             98.59%             99.60%           99.63%      Yes

Contractor Details:

Chris Lambertus

 - on call duties
 - closed 3 jira issues
 - extensive work on centralized backup deployment
   - ongoing testing and validation of zmanda evaluation
   - puppet work to build fully configuration management deployed host
 - resolved hardware issues with oceanus/FUB/Dell Germany
 - ubuntu libc vulnerability patching

Geoffrey Corey:
 - Resolved 25 JIRA tickets
 - Build out supporting environment in Puppet for bugzilla migration (sql
   database, webserver/proxy, bugzilla package building, etc)
 - Migrate and deploy bugzilla instances off baldr and into VMs
 - Investigate with others about missing LDAP accounts (and subsequently recreate)
   after LDAP server rebuilds
 - Fix authorization template regeneration (related to svn
   master rebuild)
 - Work on getting postfix alias management in Puppet
 - Begin learning buildbot related things from Gavin

Daniel Gruno:
 - Resolved 30 JIRA tickets
 - On-call duties
 - Worked on GitWcSub, the git version of SvnWcSub for potentially enabling git
   repos to act as web site sources
 - Puppetized and tested deployment of GitWcSub
 - Fixed a bunch of issues with the main MTA
 - Worked with others to resolve the aftermaths of the LDAP network redesign
 - Fixed some issues with uptime reporting and alert statuses on
 - Fixed some issues with hardcoded values in the PMC management tools
 - Reached out to contacts about the DNS system overhaul

Tony Stevenson:
 - A lot of my time has been spent on 3 major tasks:
   - The rebuild of the LDAP service due to the poorly performing incumbent
     instances. This combined with the retirement of eris (old svn-master)
     following a terminal hardware fault, we were limited to 1 LDAP instance
     in the US and proved too much for minotaur to cope with. It appears that
     the version of slapd on FreeBSD on minotaur leaked memory at a phenomonal
     rate.  The new LDAP service has been moved to the latest slapd available
     in Ubuntu 14.04, it has been fully built using puppet, and configured so
     that new LDAP hosts can be added with ease.
   - Continue preparation and understanding for a rebuild the email
     infrastructure. This has mostly taken a back seat but has been brought up
     to the top of my todo list now following the completion of the LDAP task
     above, and the CMS task below.
   - In line with our current policy of retiring hardware that is over 4 years
     old, and trying to make all services pupept managed; David asked me to
     review the CMS zone on baldr (which is now >7 years old) to ensure we can
     move the service and have it managed with puppet. However upon
     investigation it quickly became apparent moving the service was far from
     trivial given the requirement for ZFS alone.  Further reading of the code
     highlighted some areas of concern for me that I felt needed highlighting
     as they would likely carry over technical debt into the future and that
     is something we are working extremely hard to remove.  On the back of
     these findings and my inherited knowledge of the service I presented
     David with 3 options of how I thought we could manage the service going
     forward along with the estimated costs, and the pro's and con's of each
     option. There were:
     1 - Move CMS to another FreeBSD host - undesireable given our current
         trajectory of moving away from FreeBSD, it meant entirely replicating
         the FreeBSD 9 jail too. This was the path of least change, but
         perhaps most difficult.
     2 - Move the service to Ubuntu and fully puppet controlled the beginning.
         This might have been the ideal scenario, but the underlying hardcoded
         FreeBSD aspects, the need for ZFS (which is present in Ubuntu), and
         the very specific perl that is in place I felt this would take a long
         time to complete and would be significantly error prone. We would
         very likely miss something and this would need to fixed on demand. My
         confidence level was low that we could execute a clean migration.
         This would be the most time consuming option, but if retention of the
         CMS was important the best option.
     3 - Deprecate the CMS, and allow projects to determine their own
         publishing/transformations options. We would still require use of
         pubsub technology, but projects can commit into that their HTML,
         either directly edited or dervied from markdown whcih they can keep
         in their project repo (some of the finer details will need to be
         worked out later. This would essentially be the cheapest option,
         remove technical debt, and while the timeline would be many months
         contractor/volunteer time could be kept to a minimum.

21 Jan 2015 [David Nalley]

New Karma:

1429.14 - Amazon Web Services
3969.00 - Carbonite/Zmanda

As a side note, the new cards have arrived, which provide a much
better level of insight into spending; many thanks to the office
of Treasurer and EA for chasing this.

Operations Action Items:


Short Term Priorities:

Another project has requested code signing functionality. (UIMA
Four signing events occurred in the month. Two events each for Tomcat
8.0.16 and 8.0.17.

Machine deprecation
Work continues (and was slightly hindered by the holidays) on
deprecating the host the runs the writable git service and bugzilla.

Over the past several months we've found a number of services where
backups were either failing or not happening at all. We've spent a
good deal of time focusing on auditing backups and looking for a new
solution that gives us better visibility into the success or failure
of backup jobs. To that end we've selected Zmanda as our platform
of choice and have begun deploying it.


LDAP has emerged as a priority during this month. The loss of the
machine that served as the svn master last month reduced the number of
LDAP servers in our Oregon Colo to 1, and that instance is
consistently under heavy load, and logins to most services are taking
significantly longer as a result. We've had a lot of work in process,
and only recently began tackling the issue.

Long Range Priorities:


Work is continuing on monitoring, with a good leap forward this month.
We are taking advantage of (and contrbuting to) a project by the name
of dsnmp that queries the Dell OpenManage SNMP/WBEM frameworks as well
as the overall operating system health for machine. This provides
status checking to alert us to issues that have frequently resulted in
outages or service degradation. This month we were alerted to multiple
issues that we were able to address before they resulted in outages.
This is not a panacea, nor are we done with monitoring efforts, but we
are in a much better position now.

Automation progress continued, though slowed somewhat by the holidays.
The writable git service is now handled by configuration management.

Currently the following services are in progress or in final stages of


We haven't made much progress on this front in the past month, aside
from th ongoing automation efforts.

Technical Debt

As part of our automation efforts, we've been able to generate
recreatable Debian packages for our customized version of Bugzilla.
The long term plan is for us to have a private build job that builds
new packages anytime the source code in our tree for Bugzilla is

General Activity:

We continue to explore the package repository service and have folks
from Cassandra currently working on moving their existing deb
repository from www.a.o/dist to this service as a pilot. Traffic about
the pilot has generated interest from other projects.

Addressing a long standing todo that came out of the Heartbleed
vulnerability, we now have an enterprise account with Symantec
that will allow us to provision certs on demand with no interaction
required for and

We've migrated a traffic intensive service from one provider to
another to minimize our cloud hosting expenses. (Currently 3/4 of our
cloud infrastructure expense is from egress traffic). The new provider
has a different fee structure that should result in noticeably lower
service charges.

Maven has requested, and we've agreed to host, an ASF copy of the
Maven Central repository from Sonatype. Work is starting around that,
but is still early. This is expected to cost in the range of $600 per

We began discussing deprecating as the service is
provided for only 4 projects, and has only one volunteer doing the work
of administering the service. Additionally, a number of l10n services
advertise free l10n hosting for OSS projects. Many of our projects are
already making use of those free offerings.

Uptime Statistics:
We have revamped our uptime charts a bit, added some services and removed some
deprecated ones. Our current overall target is 98.59% for these samples. This
month, the overall uptime was 99.61% with critical services achieving 99.81%.
All of the downtime here was due to moving the writeable git repos to a new

Type:                   Target:   Reality:   Target Met:
Critical services:      99.50%    99.81%        Yes
Core services:          99.00%    99.66%        Yes
Standard services:      95.00%    99.18%        Yes
Overall:                98.59%    99.57%        Yes

Contractor Details:

Geoffrey Corey:
 - Resolved 26 Jira Tickets
 - Clean up some TLP server realted puppet modules to require no input
    and make sure deployment is a 1 step process (also allows svnwcsub
    use for services such as
 - Add logic to puppet that deploys Dell OMSA to physicall Dell hosts for monitoring
 - Fix svnpubsub not updating entries (related to svn master rebuild)
 - Coordinate with OSUOSL to replace disk in Arcas
 - Research fpm to build an ASF bugzilla Debian package whenever the source tree changes
 - Create bugzilla puppet module to deploy ASF's different bugzilla instances
 - Complete TLP graduation for Falcon and Flink
 - Various on-call duties

Chris Lambertus:

 - Resolved 2 Jira Tickets
 - On call (xmas)
 - Resolved a number hardware issues with erebus (bad dimm)
 - Installed and coordinated restarts for OMSA on Eirene
 - Cleanup of collectd configuration in Puppet to apply collectd to any system
 - Deployed new status.a.o at RAX
 - Installation, configuration and evaluation of Zmanda and other backup
 - Initial documnetation of zmanda license count and tally of existing
 - Oceanus troubleshooting and coordination with Dell/Dell Germany to get
   warranty location updated and parts shipped to the right place (FUB)
 - Initial ESXi configuration to enable WBEM monitoring (eirine)

Gavin McDonald:

 - Worked on 53 tickets closing 34
 - Work on puppetising TLP VMs and Blogs

Daniel Gruno:

 - Assisted Tony in moving the writeable git repos
 - Orchestrated the move of from Infra to ComDev
 - Improved monitoring of ZFS pools
 - On-call duties
 - Ongoing discussions on svn redundancy setup, corporate offers and DNS setup

17 Dec 2014 [David Nalley]

New Karma:
Andrew Bayer was added to root@

$758.48 Windows License
$761.27 Replacement Harddrives
$1607.78 Service contracts
$1247.95 AWS

Operations Action Items:

Short Term Priorities:

A number of projects inquired about codesigning at ApacheCon with
intent to sign up and request access to codesigning.
In the last 30 days no releases have been signed.

Machine deprecation:
The failure of the machine underlying the SVN master has caused some
reprioritization of work and evaluating our older physical machines
for the important services that run there. There is ongoing work to
relocate the writable git service as well as the Bugzilla machines off
of the rapidly aging hardware.

Long Range Priorities:

Work continues on monitoring, with several advances being made
The first is work with collectd that is being deployed to all of
our puppetized-machines. This gives us insight into performance metrics
of the machine. Additionally, we've managed to get Dell's OMSA platform
for monitoring the underlying hardware deployed to a number of physical
hosts. This information is being utilized to monitor for things like
failed disks, failed power supplies, and see other overall health
information like ambient and internal temperatures.

We've gotten a platform that integrates PagerDuty, Hipchat, and email
alerts for our Dell physical hardware that has OMSA installed.
You can see the code here:

We are also expanding the radius of hardware notifications from root@
to infrastructure-private. All of this while reducing that down to a
single email per day.

We continue to make large steps forward in our automation efforts.
The following services are now in configuration management. Several other
are currently in progress.
Git mirrors
Subversion master
status.a.o website

Additionally, some of our work from last months' work around
getting the webserver into configuration management has been
adopted upstream.

We now have a second critical service that we have the ability to
easily replicate and redeploy.

Technical Debt
We are encountering lots of touchpoints that are tied to specific
machine names, specific file locations on specific machines and
a large chunk of our time is spent in having to decouple this.

General Activity:
This month has been extraordinarily taxing on Infrastructure, with
three large outages or service degradations occurring following

Several volunteers and contractors traveled to Budapest to attend
ApacheConEU. In addition to a dedicated track around Infrastructure,
an Infra hackfest table was manned where folks could come and ask
questions or get help. Many folks took advantage of this.

While at ApacheConEU, a number of volunteers and contractors met with
folks from OpenOffice to talk about existing and future needs.
Details of this meeting can be seen in Andrea Pescetti's notes:

Git mirroring (including git mirroring to github) suffered some disruption
immediately following ApacheConEU. The 6+ year old machine that the
service was running on was shared with 3 project zones, and as the number
of git mirrors has increased over time, service delays began increasing
to the point that the service was non-functional. Initial attempts to
restore service on machine were not successful over the long term and we
ended up moving the service off of the affected host and into AWS. The
underlying machine that the service was running on has been deprecated. We
have sent the projects with zones there notice that we plan to shut down
the machine in the not too distant future. We have begun organizing
replacement resources in lieu of the Solaris zones.

The machine that ran the Foundation's SVN master suffered an outage caused
by a failure of the root filesystem array. The initial public reporting
is at:
The post-mortem report from this event is at:
We were able to resurrect the service on new hardware with the
configuration residing completely in configuration management. This should
minimize the time to recovery for future issues.
The recovery took approximately 2 days to complete.

We suffered a loss of all network connectivity to services in our colo facility
at Oregon State University Open Source Lab on 10 December. The outage lasted
almost 2 hours. We are still working with OSUOSL, OSU, and NERO to figure out
what happened. A redundant (but disabled) network link was activated to
bring us back online.

While monitoring the newly provisioned webserver, we discovered that
Cassandra is pointing users to a .deb package repository on the main
webservers instead of utilizing the mirrors, as package repositories
won't function with our current mirror offering. After some analysis
we found that this package repository was the source of 15% of all
traffic hitting our webservers.  Our initial thought was to block
that traffic. Doing so would have had a large impact on folks. We
are currently researching options to provide package repositories
so as to remove that load from our main webservers.

Uptime Statistics:
Unfortunately, uptime for critical services this month saw a sharp decline due
to the subversion outage. Also affecting uptime was the brief network outage on
December 10/11, as well as the migration of the git mirrors to a new location.
Overall, due to all other services behaving exemplary, we did experience a
slight increase in overall uptime by 0.03% compared to previous months. Thus,
the total recorded uptime for 2014 (weeks 27 through 50) is as follows:

Type:                   Target:   Reality:   Target Met:
Critical services:      99.50%    99.84%        Yes
Core services:          99.00%    99.77%        Yes
Standard services:      95.00%    98.38%        Yes
Overall:                98.00%    99.39%        Yes

For details on each service as well as average response times, see

Contractor Details:

Geoffrey Corey:
  - Resolved 37 JIRA tickets
  - Worked on completing the steps to graduate 4 projects to TLPs
  - First round of On-Call duties
  - Rename argus podling to ranger
  - Coordinate with OSUOSL to replace disk in Tethys
  - Coordinate with Henk P. to migrate us host off eos and
    onto the AWS TLP server
  - clean TLP off eos to help recover disk space
  - Clean up lingering details with TLP/dist and rsync being migrated off eos
  - Helped in various ways to restore svn master after the failure of eris

Daniel Gruno:
  - Helped to restore the subversion master after the failure had occurred
  - Worked on implementing more extensive SNMP monitoring of capable (mostly
    puppetized) hosts
  - Moved off the aging Solaris box and onto a new puppetized VM
  - Tweaked svn/git-to-github syncing process to cut down execution time from
    2 hours to 8 minutes
  - Helped clean up some TLP snafus in relation to bringing svn templates to git.
  - Various on-call duties
  - Resolved 35 JIRA tickets since last report
  - Fixed various rendering issues with by using asynchronous

Chris Lambertus:
  - Resolved 9 Jira tickets
  - Helped to restore the subversion master after the failure had occurred
  - On-call duties
  - fixed FUB VPN and rebuilt oceanus as cloudstack testbed
  - resolved ongoing issues with
  - resolved major outage due to eirene vmware network failure
  - resolved nightly backup problems with several hosts
  - ongoing prototyping and evaluation of "enterprise" backup solutions
  - troubleshooting assistance with TLP/git/svn puppet migrations and service
  - documented nexus project creation process
  - ongoing addition of hardware to Dell OME, acquisition of windows license
    to make this a complete service

Tony Stevenson:
  - Worked on the restoration of SVN master, including moving to Ubuntu and
  - Started work on moving the git-wip-us service off of again baldr onto a
    Ubuntu VM.
  - Various on call activities
  - Attended Apachecon, where we had several infra sessions and an Infra
    meetup on the Sunday following the conf.
  - Worked on several JIRA issues and some longer term background activities
    like host patching.
  - Conducted the SVN master post-mortem exercise
  - Writing and updates of puppet modules to continue to make them platform

Gavin McDonald:

 - Worked on 75 Jira Tickets, closing 57.
 - Infra commits SVN - 33
 - On Call duties.
 - Work with new contractors on various issues.
 - Updating non puppet VMs/Machines for security updates.
 - More work on reducing cron mails
 - Work started on archive logging for HipChat
 - Restored SVN to Buildbot Hooks mechanism after the move.

19 Nov 2014 [David Nalley]

New Karma:

Joe Schaefer has resigned from Infrastructure Committee (and root@)


$250 for AWS

Operations Action Items:

Short Term Priorities:

* Wiki outages - As highlighted in the October Infra report we have been
  running into issues with the MoinMoin wiki. This degradation is caused
  by severe disk IO load on the machine that hosts this service. This is
  complicated by the fact that this same machine also hosts the US web
  mirror for the foundation and projects as well as the mail-archives
  service. Additional fallout has been that publishing website updates
  has tremendous delays for websites on the US mirror. We think that much
  of the sudden IO load increase is due to the machines ZFS-filesystem
  growing to ~90%. Because of Copy-on-write nature of ZFS, and the
  allocation switches that happen when a volume begins approaching
  capacity, performances several degrades. We evaluated all of the
  services on the host, and our initial analysis was that we'd be best
  served by separating the web sites for projects and the foundation.
  During the course of doing that, we've discovered that a large number
  of projects were distributing artifacts and publishing their website
  using a long-deprecated (February 2012 deprecation) method. This, and
  other factors, have complicated the process, but today I am happy to
  report that we now have an easily replicable webserver definition in
  configuration management that allow us to easily deploy any number of
  webserver hosts in short order, and paid off a large amount of
  technical debt in the process, as well as having the newer members of
  our team understand well the entire website process from checkin to
  publication. This has dramatically improved the wiki situation, though
  it has revealed some underlying issues with the wiki that will need to be
  dealt with.

Long Range Priorities:

While we continue to work on monitoring, and have far to go, this
month yielded two major improvments. The first in disk capacity alerting, and
the other in failed array management. This is not yet pervasive in our
infrastructure, but is a start towards that end.

This month resulted in a large step forward as a number of services
now able to be deployed in an automated fashion and in configuration
management. Many of these have been in-process for a month or longer.
to include the following services:
* committers mail-relay
* all project and Foundation websites
* (mirror distribution)
* host provisioning dashboard
* inbound email MXes

The work done around automation has given us our first critical
services that we can easily replicate and deploy multiple. To give you an idea
of scale we can deploy an external email exchanger, completely configured, in
than 10 minutes, or all of the project websites in about 3 hours (largely
bound by having to download all of the site content)

Technical Debt

This month saw us paying back large portions of technical debt. Of particular
interest is the shuttering of legacy means of publishing releases and websites
that were deprecated almost 3 years ago. We also were able to decouple a large
number of very tightly bound services.

General Activity:

It is worth noting that part of the new monitoring systems we have in place
keep an eye on the status of internal disks and disk arrays. On the 30th
October we were notified into our HipChat room that the machines that
services as our svn master had a bad disk in its array. One contractor
went to the datacenter to replace the disk with a spare from our inventory
whilst another contractor configured and onlined the disk, re-adding it to the
pool. A hardware issue being notified to being replaced and back online
 all in the same day.

Code Signing
Two more releases were signed this month, both from Tomcat.

Three additional projects (Logging (Chainsaw), OpenOffice, and OpenMeetings)
are now setup and enabled to sign artifacts, though most are still testing.

Uptime Statistics:
Overall, uptime has seen an increase in 0.30% compared to
last month, putting uptime for the october-november period at a record high
99.82% overall. The total recorded uptime stats since we started measuring it
are as follows (weeks 27 through 46):

Type:                   Target:   Reality:   Target Met:
Critical services:      99.50%    99.97%        Yes
Core services:          99.00%    99.77%        Yes
Standard services:      95.00%    98.14%        Yes
Overall:                98.00%    99.36%        Yes

For details on each service as well as average response times, see

Contractor Details:

Daniel Gruno:
    Non-JIRA related issues worked on:
    - Explored and implemented a rewrite of our DNS system
    - Miscellaneous help/guidance for new staffers
    - Collated www+tlp server stats for an overview of our traffic/request
    - Worked on setting up Chaos as a disk array for Phanes (for unified
    - On-call duties
    - Assisted Geoff in moving rsync'ed data to svn for web sites
    - Helped tweak httpd instance on tlp-us-east to cope with the request load
    - Tweaked, added hard-coded notice about wiki.a.o
    - Fixed dependency issues with Whimsy
    - Deprecated SSLv3 on all SSL terminators in response to POODLE

Geoffrey Corey:
    Non-JIRA related issues worked on:
    - Finished migrating all tlp sites into puppet and new tlp host
    - Migrated lingering projects using rsync for artifacts distribution to
      using svnpubsub for distribution
    - Clean up retired sites with correct redirects for www.a.o/dist to their
      attic pages
    - Decommission/surplus the old hermes hardware
    - Replace disk in erris

    JIRA related tasks:
    - Resolved 11 JIRA tickets
    - Renamed incubator project optiq to calcite

Gavin McDonald:

 - Worked on 52 Jira Tickets, closing 27.
 - Infra commits SVN - 40
 - On Call duties.
 - Work with new contractors on various issues.
 - Resolve queries from IRC, HipChat and Email (no jira tickets)
 - Updating more Ubuntu machines/vms for Bash vuln.
 - Worked more on upgrading pkgng FreeBSD machines.
 - Continue Work on improving Pass rate of builds, liaising with projects as
 - Configure new disk into Eris Array.
 - Adding packages to all Jenkins slaves via ansible is now working fine. Work
   on doing the samd for Buildbot slaves.

Chris Lambertus:

  - Closed 9 jira tickets
  - First on-call
  - Created new VM for status.a.o migration to Cloud (RAX)
  - Began work on evaluating backup and disaster recovery processes
  - Noted & resolved problems with zfs on abi causing failed backups
     - upgraded abi to FreeBSD 10.0
     - purged extraneous zfs snapshots
  - analysis and evaluation of tools for improved backups
  - implemented collectd puppet module (monitoring)
  - added circonus monitors for new tlp (monitoring)
  - begin work on oceanus cloudstack eval
  - troubleshooting & repair
  - MX incubator list troubleshooting with Tony
  - metis disk replacement PERC troubleshooting with Tony

Tony Stevenson
  - Working on several major priorities:
    - eos - The main US webserver has slowly over time grown it's disk
      usage levels - most recently growing over the threshold at which ZFS
      suffers from significant performance penalties.  Disk capacity cant
      be increased and eos is scheduled for EOL so the short term goal was to
      tidy up the data on disk. This was acheived by moving some of the older
      static data to eris. See work by others on the overall status of
      eos. Also see below commentary on a wiki migration PoC.
    - abi - The host in (FL) that we have been using for an offsite
      copy of data for a number of years had suddently become increasingly
      unusable and jobs were failing. This was primarily caused by a failure
      removing old data snapshots. This essentially stemmed from the period
      when hermes had to be rebuilt. A lot of triaging of old copies of data
      had to be done this was done in conjunction with others notably cml@
    - hermes - Fixed a long standing issue with a faulty disk on the dungeon
      master that hosts hermes. Also worked on better manaaging the incoming
      mail queue as on occasion it backlogs and has a compound effect on
      genuine mail delivery.
    - chaos (host where ELK is to be part-deployed) this work was delayed
      AC EU giuven the more urgent issues on eos and abi.
    - MX - After the unsuccessful attempt to migrate the MXes to new hosts run
      from AWS EC2, several lessons have been learnt and we have fixed all but
      one of these at the time of writing this report.  The last fix is more
      complicated and needs significant testing to sign off, and then we need
      to expose this change to the mailing lists affected so that they are
      in the loop, though we are aiming for a completely transparent cutover
      when we came to implement it.
  - As part of the bigger piece of work to unpick all the services on
    (and dependant upon) eos I have started work on setting up a PoC that will
    host the moinmoin wiki service ( in AWS EC2. This is
    good progress and a data synchronisation should begin during AC EU
    the team to see it working during the F2F on the Saturday after AC EU.
  - More puppet work creating new and adding 3rd part modules further
    the puppet managed aspects of our machines.
  - secretary@ workbench issues, this was seemingly related to a corrupt mbox
    file on minotaur. Moving this aside and having clr@ manually process the
    period allowed us to re-enable the automatic service, allbeit at a much
    slower frequency than before.

15 Oct 2014 [David Nalley]

New Karma:

Domains: $85

Operations Action Items:

Short Term Priorities:

Code Signing
The code signing service is now live. Two projects have successfully shipped
signed artifacts. Two more projects are in various stages of signing up, being vetted
or testing out the service.

Long Range Priorities:

Work has been ongoing to provide monitoring of the underlying hardware many of our
services depend on, this is still in the early exploratory stages and remains ongoing
Work building on our existing base of service status monitoring to provide insight,
is also ongoing.

Much progress has been made around automation, multiple services are now fully defined
in configuration management, though we still have much to go.

Exploratory work to evaluate our ability to recover from disaster is underway, but is
still early.

Technical Debt

General Activity:

The primary US web server that provides foundation and project websites in addition
to mail-search and the moin-moin wiki has been problematic this week. Specifically
we're running into multiple problems occurring at once resulting in tremendous
IO overhead, and leading to very slow responses or outright failures.

This has been a very busy month from a security perspective. Our Bugzilla instances
went through multiple rounds of patches in response to 5 related security issues.
Additionally, we've spent much time responding to Shellshock.

Uptime Statistics:

Detailed Contractor Reporting

* Geoffrey Corey

  - Get acquainted with ASF's system and services layout
  - Resolved 7 JIRA tickets
  - Clean out ASF's server racks and old hardware/spare parts at OSU
  - Inventory spare hw at OSU
  - Learn how to do TLP requests
  - Learn how to do svn to gi migrations
  - Setup AOO's mac mini build slave
  - Learn ASF's puppet infrastructure
  - Deploy NTP puppet module
  - Learn how svpubsub and svnwcsub are setup and used, create a puppet module for it
  - Learn how to use AWS to begin migration of TLPs off eos to fix IO issues

* Gavin McDonald

 - Worked on 44 Jira Tickets, closing 30.
 - Infra commits SVN - 59
 - On Call duties.
 - Work with coreyg liasing with hardware to be removed.
 - Installed 2 new machines at OSU, DRACs configured, ready for OS installs
 - Resolve queries from IRC, HipChat and Email (no jira tickets)
 - Work through more Cron noise.
 - Worked with Intervision and organised Warranty renewals/declines.
 - Updating many Ubuntu machines/vms for Bash vuln.
 - Worked on and continiung to work on upgrading the FreeBSD machines. There is much
     work involved as we are breaking free of Tinderbox based updates and going
     direct to the official repositories. At the same time packages/ports are being
     updated (forcibly) to the new pkg system. The machines done so far are showing
     no ill signs as of yet but we havent forced an upgrade of all packages, just a
     few essential ones. I expect as more machines are done, we'll start to see
     some things break. This is an unavoidable one way trip that we'll deal with.
 - Started looking into automating jira project key renaming, in support of dealing
   with when projects rename themselves. The Jira cli plugin looks promising, but
   doesnt seem to support renaming yet (but offers project cloning and deleting).
   Investigating the API directly but that too seems to lack support thus far.
 - Worked on improving Buildbot slaves stability.
 - Worked on improving Pass rate of builds, liaising with projects as neccessary.
 - Restored RAT reports for projects and the RAT master summary pages.
 - In our cwiki, added the ability for projects to access Intermediate HTML for
   diagnosing formatting issues in PDF exports. (
 - Github to Buildbot to HipChat integration, testing github commits to infra repo

* Chris Lambertus
  * Closed N (I don't know what N is) tickets.

   * Troubleshooting and resolution for the vmware host outage.
   * ongoing, www.a.o/mail-search/moin-moin troubleshooting

   * Learning puppet
   * Implemented dovecot and SNMP modules

   * Researched ways to monitor existing hardware that runs FreeBSD.
   * Began needs analysis and some PoC work around monitoring with
Circonus, collectd, SNMP, etc.

Disaster Recovery
   * Initial research into current state and how we might improve our
disaster readiness.

17 Sep 2014 [David Nalley]

New Karma:
Chris Lambertus (cml)
Geoffrey Corey (coreyg)

RAM for VMware hosts  $715 Replacement HDDs  ~$1700
Mac OSX Build Slave $730
Puppet training $1300
Domains: $17

Operations Action Items:

Short Term Priorities:

## Code signing Mark Thomas successfully concluded his testing and we were
able to come to agreement with Symantec. The service has thus far been
deployed with Mark Thomas leading efforts to deliver signed code for Apache
Commons. The Apache Commons PMC is currently voting on release artifacts for
the first signed binaries. Post-completion of this test the service will be
available to any PMC requesting the service.

## Build/CI environment.
### Yahoo has graciously increased the number of machines that they provide
(and provide colo services for) to a total of 20 machines this year. This has
tremendously reduced the pending queue size for our build services.

### Cloud slaves Our RAX cloud environment is now being utilized by Jenkins to
deploy (and destroy) machines on demand in response to load. Additionally,
we’ve made a RAX account available to the Gora PMC for twice yearly testing
they plan to engage in.

Long Range Priorities:

* Monitoring We are beginning to get insightful information out of monitoring.
We now have a mail loop that provides information on the cycle time from
sending to mail reception. Additionally we now have started monitoring some
elements of host storage. Centralized logging is making slow progress but has
a plan with a time table.

* Automation The base level framework for machine automation is complete; and
that work is expanding. As we begin to need to break services out we are
building them with puppet. Additionally work to programmatically have JEOS
machines for bare metal as well as virtualization and cloud targets is
progressing nicely with most of that work expected to be wrapped up by end of

* Technical Debt/Resiliency Some work has happened identifying long
complaining error conditions in a number of processes and resolving them;
currently focused on errors around backup scripts.

General Activity:

* Welcomed two new contractors, Chris Lambertus and Geoff Corey, to the fold.
* We’ve dealt with a unusually high number of failed hardware issues this
* Sourceforge has reached out to infra regarding migrating Apache Extras to
* The machine that houses our US web server (for www.a.o and $tlp.a.o) as well
as mail-search and the moin-moin wiki has experienced tremendous IO load. Work
is ongoing to breakout those services and reduce total IO load for any given
machine. This has been noticeable to end users in the form of wiki slowness
and updates to project websites being slow on the US website.
* suffered a severe service degradation that resulted in
many projects being unable to publish artifacts to Nexus for several days. For
details see:
* We’ve found a number of processes that infra executes that appear to be tied
to being listed as a member in LDAP. We’re working to resolve that issue and
tracking it in INFRA-8336

Uptime Statistics:

Targets remain the same in the last report (99.50% for critical, 99.00%
for core and 95% for standard respectively).
These figures span the previous reporting cycles as well as the
present reporting cycle (weeks 34-38). Overall, the figures have gone up
since the last report, and we are continuing to meet the uptime targets.

Type:                   Target:   Reality:   Target Met:
Critical services:      99.50%    99.97%        Yes
Core services:          99.00%    99.77%        Yes
Standard services:      95.00%    97.46%        Yes
Overall:                98.00%    99.16%        Yes

For details on each service as well as average response times, see

Detailed Contractor Reporting

* Daniel Gruno:
 Work done since past report:

 - Cleared 30 JIRA tickets. See those for additional details.
 - Helped introduce Chris to his new job. This included setting up his account,
   putting it into the correct staff groups, assigning some easy JIRA tickets
   to get started with and walking him through the process of resolving these
 - Fixed some mailing lists mistakenly marked as private. This seems to be a
   reoccuring problem, so we will need to tighten our mlreq page and make it
   harder to create a private list.
 - Created new mailing lists for Reef.
 - On-call duties.
 - Started work on the Infrastructure presentation for ApacheCon EU.
 - Discussed doing a "Git at the ASF" talk with David at ApacheCon, as we have
   a free slot.
 - Dealt with Freenode's security breach (mainly rerouting some IRC services to
   the EU and resetting passwords).
 - Started work on resolving the current issues faced by non-member staffers.
   This will likely take some time to finish, and involve several people. Our
   first priority should be getting a new ACL set up for browsing the mail
   archives, so root has acces to this data. This is a sensitive operation, but
   one that should be well covered by the confidentiality clause in staffers'
 - ELK stack is progressing, storage setup expected to be done this week, at
   which point we will be able to start pointing some of the heavier services
   to it.
 - Answered queries from Joe Brockmeier re Hadoop moving to Git and the new
   status page.
 - Helped EVP and fundraising with the new Bitcoin donation methods (and
   answered queries on that).
 - Added commit comment integration with GitHub. This is still a work in
   progress, and I plan to rewrite then entire integration system when time
 - Moved some VMs around in response to prolonged downtime on Erebus due to
   disk replacements. This resulted in minimal downtime for services (a few
   seconds at most).
 - Finished work on the subscription service for our monitoring of local
   project VMs.

* Gavin McDonald:
 Work done since last report:

 - 68 Jira tickets closed.
 - 7 Jira tickets closed were Hardware related repairs - Disks, PSU and Memory
   The hardware situation is much better. Still a couple to resolve.
 - More Jenkins work done, Ansible issues determined but the slaves are
   unreliable at present. Still have no OOB access to them and 9 times out of
   10 if a reboot is needed the slave doesnt come back. This situation is
   only tollerable for a certain perios and that is nearly up. David is in
   talks to get more slaves available.
 - Buildbot has been worked on some more, it got left behind due to other work
   but is now getting some love once more. There is one major nag in that some
   slaves (and seem to be only the new ones) are failing randomly with xml
   corruption failures even though the checkout performs fine. Testing shows
   that the xml isnt being returned (but only some of the time.).
 - There are plans to upgrade Buildbot Master this month.
 - There are plans to upgrade Confluence (accross several versions) this month.
 - On Call duties
 - Working through reducing cron mails

20 Aug 2014 [David Nalley]

New Karma:

RAM for VMWare host:        $690.79

Operations Action Items:

Short Term Priorities:

* Code signing
  Mark Thomas has successfully concluded his testing of Symantec Application
  signing service. Subsequent to that, he's identified a workflow that should
  work for our many projects. Conversations with Symantec on pricing are

* Response timeframe targets:
  It was agreed to set up three distinct timeframes for responding to incidents;
    1) For critical services, incidents should be responded to within 4 hours
    2) For core services, incidents should be responded to within 6 hours
    3) For standard services, incidents should be responded to within 12 hours.

  The response need not be a resolvement of the issue, but needs to include one
  or more of the following steps;
    1) Acknowledging the incident through internal channels (See PagerDuty
       et al)
    2) Communication of the incident to the involved/affected people in
       accordance with the new communications plan laid out by the VP.
    3) Delegation of the issue to a member of infrastructure whenever possible
    4) Tracking of the incident (method depends on the duration and gravity of
      the incident)

* On-call rotation:
  At the f2f meeting in Cambridge, it was decided to introduce an on-call
  rotation between contractors. Each week, a contractor will be assigned as
  being on-call, and will be responsible for either resolving, delegating or
  communicating about outages, account-, mailinglist- and tlp-creations, as well
  as planned changes, and security issues. To the extent that this is possible
  (not counting sleep), incidents must be responded to within the new target
  timeframes, as explained in the previous paragraph.

  In the time since going live with both an on-call rotation and tracking
  response time in the response timeframe, the responses have been
  dramatically faster than the service level expectations. As we build up
  staff numbers and spread geographically a bit, the expectations may change.

* Improved response for and analysis of Java services at the ASF: At the
  previously mentioned f2f meeting, contractors were introduced to a detailed
  course of analysing and reporting incidents with Java applications run by
  infrastructure. We expect this new information to be extremely valuable in
  reaching and maintaining the target uptime for Java services. The staffers
  would like to extend a very big thank-you to Rainer Jung for his services in
  this matter.

Long Range Priorities:

* Monitoring
  ** Uptime monitoring and responsibility:
     In addition to the previous board report, a new service level agreement was
     made between infrastructure staffers, increasing the targets for uptime on
     critical and core level services, as described below in the statistics
     paragraph. Ensuring that services meet the new targets have been made one
     of the cornerstones of infrastructure's work. The monitoring of public
     facing services has been outsourced to a third party (free of charge), and
     we will be focusing on having Circonus produce metrics for our inwards
     facing services/devices, such as LDAP, PubSubs, SNMP etc.

  ** Unified logging:
     Experiments with unified logging is proceeding as planned, with more and
     more hosts being coupled into the new logging system. A filtering
     mechanism for the lucene-based backend has been created, allowing anyone
     to use the logging service based on their LDAP credentials. As such,
     anyone with access to a specific host (as defined in LDAP) will be able to
     pull logs from the unified logging system. We are confident that this will
     make debugging and analysis easier, to the point that we are disabling
     older alerting/information systems and using the logging system to fetch
     information that would previously have been sent via email to root@.

* More virtualisation; Better use of what resources we have:
  It was decided to move towards more use of virtualisation for many critical
  and core services, including our main web sites and wikis. This will allow
  us to better respond to incidents and resolve them without affecting other
  services. Furthermore, it is our belief that we can free up resources by
  switching to a virtualised environment, thereby possibly getting more
  space for the crammed-up project/service VMs.

* Automation
  ** Cloud-based dynamic build slaves have been in progress for a bit. Much of
     the work around this has been driven by Dan Norris. Building on a framework
     of repeatable builds he and Gavin McDonald have been successfully spinning
     up on-demand build slaves with our RackSpace account. This also relates to
     our goals around configuration management, and configuration of the machine
     is in Puppet. Expect to see this service go into production in the next
     week or so.

  ** Puppet - the scope of puppet deployment continues to edge forward. Gavin
     McDonald attended training just before the Infra F2F meeting.

* Technical Debt and Resiliency Work around uptime monitoring and actually being
  able to better understand where the pain points are, coupled with some of the
  knowledge we gained at the Infra F2F has allowed us to focus on long term
  adjustments rather than hasty short term restoration of service. You should
  see this reflected in the uptime statistics

General Activity:
- A face-to-face meeting between infrastructure members was held in Cambridge,
- 27 new committer accounts created, 8 new mailing list (TBC)
- 3 projects were promoted to TLP
- 193 JIRA tickets resolved (since last report)
- A new status site was launched by Infra at
- PagerDuty has donated a gratis account for up to 10 users to ASF Infra

Uptime Statistics:
Due to new a SLA between contractors, the targets for critical and core services
have been updated to reflect the new criteria (99.50% for critical and 99.00%
for core respectively). This represents an overall increase of 0.57% uptime
across the board. These figures span the previous reporting cycle as well as the
present reporting cycle (weeks 29-33)

Type:                   Target:   Reality:   Target Met:
Critical services:      99.50%    99.94%        Yes
Core services:          99.00%    99.81%        Yes
Standard services:      95.00%    96.83%        Yes
Overall:                98.00%    98.99%        Yes

For details on each service as well as average response times, see

Detailed Contractor Reporting

* Daniel Gruno

  - Resolved 51 JIRA tickets
  - Worked on a new status site for public ASF services
  - Worked on the ELK stack, set up on phanes/chaos.
  - On-call duties
  - Worked with Gavin and Tony on OpenSSL CVEs and general VM upgrades
  - Fixed issues with svn2gitupdate not working
  - Worked around a GitHub API change that had invalidated our integration
  - Miscellaneous upgrades to ASFBot
  - Continued work with uptime monitoring and reporting
  - Worked with Fundraising to produce statistics about the ASF
  - 8 days of vacation.

* Tony Stevenson

  - Resolved 74 issues
  - Took part in the bugbash
  - On-Call rotation
  - FreeBSD/Ubuntu SSL CVEs.
  - Started work to disable swap across all VMs
  - Investigations into BigIP/F5 etc
  - Fixed issues with VPN applicance
  - Several disk replacements
  - Run down several repeat cron error messages
  - Some further conversations with others about puppet
  - Setup trial of lastpass with a view to possibly replacing our GPG files.
  - Instigated the trial of hipchat. With a view to seeing if we coild deprecate

* Tony Stevenson - Comments


  For a long time I have been thinking about trying to find a better way to
  engage with some of our users. Also, I was hoping to find a way that we could
  get a better feed of information that was more relevant and pertinent.

  We have hooked it up to JIRA, Github, Pagerduty, and PingMyBox. These all
  provide near realtime information that we can act on.

  The more modern service, perhaps will be seen to be a move on from some of our
  older roots. Which might appeal to others. With the move we have also seen the
  SNR improve significantly enabling better communication across the team.

  The alerting with Hipchat allows people to be notified of communcations that
  involve them, via push messages to a phone/tablet etc. Also once you return
  online from an offline state you see all the history. The history is

  You can join us, here,

  Private channels can also be created, for those who need a channel that need
  to control access. Think #asfmembers etc

* Dan Norris

  - Built machine image automation using Packer
  - Packaged (using FPM) many of the unpackaged build tools (Provides
    repeatable, known installation; allows us to query for status and version)
  - Deployed a DEB repository in RAX CloudFiles for packages
  - Puppetized the build slave configuration
  - Documented the process of building a machine image, uploading it to RAX
  - Using jclouds plugin for Jenkins, successfully provisioned dynamic build

16 Jul 2014 [David Nalley]

New Karma:

Dan Norris (dnorris)

* 64GB RAM for Arcas:       $777.16

Operations Action Items:

Short Term Priorities:

* Signed binaries
  Mark Thomas has made significant progress in his efforts with Symantec around
  using their Binary Signing as a Service product. I have high hopes that we are
  near a proposed solution.

* builds.a.o
  Much work has been done around builds.a.o; and it's largely stabilized. The
  past month has yielded 99.92% uptime. That's a far cry from the routine
  outages that were happening on average once a day.

Long Range Priorities:

* Monitoring
  ** Uptime monitoring and reporting:
     As an extension to defining core services and service uptime targets,
     Daniel has begun compiling weekly reports of uptime for most of the
     publicly facing services. These reports will in turn be compressed into
     monthly reports for the board as well as a yearly report detailing the
     overall uptime reality vs our set targets. Eventually, these reports will
     also feature inward facing services.

   ** Unified logging:
      Discussion and exploration has begun on unifying logging on all VMs and
      machines. The logging will be tied to puppet and allow for easy access to
      each hosts logs from a centralized logging database, as well as allow for
      cross-referencing data. Initial exploration into using LogStash with
      ElasticSearch and Kibana have begun, and are expected to produce findings
      for use in the next board report.

* Automation
  Tony has expended effort and time in deploying a more updated, platform
  agnostic base for puppet. Giridharan Kesavan and Gavin have been
  experimenting with using Ansible for build slave automation. Dan Norris
  has begun work on automating VM/cloud provisioning

* Technical debt
  Gavin began addressing cruft in many of our automated jobs; this will be a
  long term effort, but that work is underway and already yielding benefits

  In some ways, we are just beginning to collect information to let us know
  where we stand, and exactly how much debt we have accrued. The uptime reports,
  and comparing that with our first pass at service level expectations has
  started occurring.

* Resiliency
  Our efforts around resiliency are still nascent. We have begun to address a
  few issues caused by resource constraints, though this is a very minor attempt
  to provide true resilience. As other efforts in our long term priorities take
  shape, I expect that we'll begin to see this accelerate.

General Activity:

* Dealt with yet another batch of OpenSSL CVEs affecting all hosts.
* Upgraded Arcas (JIRA host) with 64GB RAM to deal with slow response times.
  This has greatly reduced the response time from Jira. See screenshot detailing
  that change:
* Welcomed a new contractor, Dan Norris, to the fold.
* Face-to-face meeting in Cambridge between infrastructure people
* Created 23 new committer accounts, 4 new mailing lists

Uptime Statistics:
These figures currently span weeks 27 and 28 of this year, and only cover public
facing services.

Type:                   Target:   Reality:   Target Met:
Critical services:       99.00%    99.98%        Yes
Core services:           98.00%    99.84%        Yes
Standard services:       95.00%    92.71%        No[1]
Overall:                 97.43%    97.80%        Yes

For details on each service as well as average response times, see

[1] The target for standard services was not met due to our Sonar instance
    being unstable at the moment and only having around 50% uptime. We are
    investigating the issue.

Contractor detail:

* Gavin McDonald

Short term Jobs worked on this week:


Jira tickets worked on [12] See: jql query 'project = INFRA AND updatedDate >=
'2014/06/16' AND updatedDate <= '2014/06/22' AND assignee was ipv6guru ORDER
BY updated DESC'

Jira Tickets Closed [10] See: jql query 'project = INFRA AND resolutiondate >=
"2014/06/16" AND resolutiondate <= "2014/06/22" AND assignee was ipv6guru
ORDER BY updated DESC'

My Open Jira Tickets [34] See: jql query 'project = INFRA AND status != Closed
AND assignee was ipv6guru ORDER BY updated DESC'

Commits made Infra repo: 22

June 16th saw planned downtime at OSUOSL. The downtime window was 2 hours
between 11am UTC and 1PM UTC. Both myself and Daniel Gruno covered this outage
window and also at least 2 hours before and after the planned window. Actual
downtime we saw was 2 minutes at 11:55am.

Ongoing answering of queries on the infra@ and build@ mailing lists, including
quick resolutions to issues raised. The same goes for IRC - Channels open at
time of writing are:
  #sling #asfboard #jclouds #asfmembers #asftac @#abdera #avro #osuosl
  #+#buildbot #asftest @#asfinfra

Worked on various buildslaves of both Buildbot and Jenkins, updating,
upgrading, patching for SSL etc. Worked on upgrading SSL for several other
VMs, at the same time taking the time and opportunity to
update/upgrade/dist-upgrade and reboot.

Ongoing Medium Term Jobs:

  Dell Warranty Renewals.

   Involves Liaising with various Dell Reps via email. Service Tags have been
   99% been brought upto date and documented in the service-tags.txt file. Make
   decisions on warranty renewals based on age and whether it is in our plan to
   renew the machine within the next 9 months. Get quotes for and give the go
   ahead to Dell for those we intend to renew. The current email noise from Dell
   regarding these is quite high so this is a task I intend to complete over the
   next few weeks - to either renew, or decline and stop renewals emails.

  Root Cron Job Emails.

  Involves sifting through root@ Cron emails from various machines and vms.
  Determine the current important ones that can be assessed and fixed to
  completion. Previously, this was just 'done' and perhaps followed up with an
  email reply to a cron job in question. For better visibiilty and reporting, I
  have now started creating Jira Tickets for these tasks; and also given these
  tickets the 'Cron' label. I expect to make steady progress and have the cron
  mails halved at least over the next 3 months.

  See JQL Query: 'project = Infrastructure and labels = Cron'

Confluence Wiki.

Confluence needs an upgrade. Test instance is in progress. I hope to have this
done in the next couple of weeks.

Ongoing Longer Term Jobs:


Some time has been spent improving the stability of Jenkins Server and its
Slaves. With thanks Mainly to Andrew Bayer recently the Server has improved
dramatically. The slaves have seen improvement in stability and uptime too,
including the 2 windows machines. I have spent a fair bit of time recently on
these. I need to create new FreeBSD and Solaris slaves for Jenkins. The former
I think we can achieve in the Cloud whilst the latter I don't think is
supported at RackSpace, investigating. Might need to create our own VM image
for it. At the time of Writing, 34 Builds are in the Jenkins Queue, mostly
attributed to these missing two slave OS flavours and also Hadoop jobs.

Buildbot stabilty is just about back to normal after I rebuilt the Master from
scratch on a new OS Freebsd 10 (prepped by Tony). The forced upgrade of the
Buildbot Master version itself also caused some instability for a while due to
configuration upgrades required. This affected just about all projects using
Buildbot and the CMS. I note that the Subversion project has indicated that a
Mail should have been sent to the Subversion PMC about the downtime suffered
by the Subversion project as a result of the code changes required by the
forced upgrade. Following this advice, I'd have had to email another 30+ PMCS
also telling them the same thing. I find that my generic email to the infra
list should have been enough information for all parties concerned.

  Cloud for Builds.

  Rackspace - A test machine has been created. Jenkins has yet to make use of
  this however and I'm in progress of working out the best way to integrate with
  our systems - do we use LDAP, Puppet etc with it or create a custom image we
  can replicate. I'll also be starting work soon on a Buildbot test instance for
  on demand.

  Microsoft Azure - A test machine with windows server 2012 is up and running
  and I have access. I am in progress of making changes to this image to make a
  baseline so that the Azure team can replicate several more once I have it
  right. Once done for Jenkins I'll do the same for Buildbot; and make sure to
  leave 2 or 3 instances available for general project use, which I'll advertise
  as available once ready.

  Puppet - Have completed online pre-training puppet course as advised by David,
  using a Vagrant instance via VirtualBox. I continue to invest a couple of
  hours a week in looking through the Puppet Labs online and Documentation. I
  continue to investigate the best methods of integrating the Jenkins and
  Buildbot Slaves with Puppet, though I'm really in a waiting pattern for our
  puppet master to be upgraded to v3.

* Tony Stevenson

Took two weeks of vacation

Having spent a considerable amount of time trying to make a new Puppet3 master
on a FreeBSD box this however did not pan out - there were far too many little
changes from a standard deployment needed and we were still having ssl issues
with puppetdb.

A new Ubuntu VM has been built as the new puppet master and is now about done.
One more test to run tomorrow.

Spent a little bit of time on-boarding jake into root@ activities (a/c
creation etc).

Issues with Erebus VMware host. Needed a reinstall of the vsphere agent and
reconnecting to the management console.

New infra-puppet GitHub repo

* Daniel Gruno

Work log for Week 27:
  - Create mailing lists for new and existing podlings
  - Access to metis+eris for jake.
  - Set up svnpubsub/cms for new podlings
  - Evaluate ELK stack (ElasticSearch, Logstash + Kibana)
  - Work on factoid features for IRC
  - Ordered 64GB RAM for Arcas (8x8GB, replacing 7x4GB)
  - Upgraded hardware on Arcas (JIRA host)
  - Set up dist areas (some requests proved invalid)
  - Investigate and fix database issues with ASF Blogs (twice)
  - Monitor and compile uptime records for core services over the
    last week.

   (The majority of my time was spent evaluating and tailoring the ELK
    stack, as well as the math fun with semi-automating uptime reports.)

Work log for Weeks 25 and 26 (sans JIRA tickets):
  - Updated ASFBot with some minor bugfixes and feature additions
  - Worked on Git mirroring between ASF and GitHub (aka svn2gitupdate)
  - Assisted in applying web server updates for projects
  - Design discussions with Jan and Gavin about Circonus monitoring (still
    ongoing, awaiting results of initial test)
  - Discussed GitHub PR usage with the Usergrid project
  - Investigated and solved an issue with JIRA not responding
  - Worked on updating OpenSSL on all affected machines (CVE-2014-0224 et al,
    ~95% done, should be done by the end of this week (ceteris paribus))
  - Worked on an issue with nyx-ssl and puppet (still unresolved)
  - Worked with Gavin to monitor and respond to OSUOSL network upgrades.
  - Helped projects tweak settings for IRC relaying of commits/JIRAs
  - Worked on anti-spam measures for under infra's
  - Worked with Dave to resolve the blogs 404 issue. Resolved in week 27 by
    Brett Porter.

18 Jun 2014 [David Nalley]

New Karma:
* Jake Farrell (jfarrell) was added to root
* Andrew Bayer (abayer) was added to infrastructure-interest


Discovered a past due bill from Dell based on Justin Erenkrantz getting
collection phone calls. ~$1300

Placed order for 2 servers, totalling nearly $17,000

With help from EA, arranged for travel for a F2F as well as
travel for contractor training; thus far that has cost ~$9193

Operations Action Items:

None at the moment

Short Term Priorities:

* OSU Hardware failures Work continues on hardware failures at OSUOSL -
replacement hardware has been ordered and shipped, work continues on getting
it swapped in while minimizing outages.

* Outage remediation Much work continues from the action items drawn from the

* Builds.a.o We've received a good deal of help from the Jenkins community in
finding and dealing with issues.

Long Range Priorities:

* Monitoring Circonus has now replaced Nagios as our monitoring system with
lots of help from Jan Iversen.
While we still have a very long way to go, the system is already proving
useful; having alerted us
to a number of issues.

* Automation Slow progress continues on rolling out configuration management
in efforts to make our infrastructure better documented and more easily

* Technical Debt We have begun publishing/discussing early drafts of documents
around expected service levels as well as a communications plan; which are
very early steps in beginning to prioritize work around our technical debt.
xt and

* Resiliency Discussions around resiliency have started; but are still

General Activity:

In the month of May Infra had 194 tickets opened, and closed 158 tickets in
Jira. For the month of May, Jake Farrell closed the largest number of tickets
with 56.

highlights include:

* Dealt with emerging DMARC issue and blogged about it at

* Rewrote our qmail/ezmlm runbook documentation to bring it up to date.

* Raised potential UCB issues with our current organizational usage of
committers@. See INFRA-7594 for background.

* Dealt with a crop of openssl-related security advisories.

* Dealt with two as-of-yet unpublished security vulnerabilities.

* Published a blog entry on the mail outage postmortem:

* Confluence was patched after advance warning from Atlassian before they went
public with a security vulnerability.

* During the course of compiling an inventory for Virtual and adding in our
cost to purchase those units, we discovered that 5 machines were not
in our inventory[1]. Three of those machines were either unutilized or
underutilized. This will likely reduce some of our expected hardware spend as
they were relatively recent purchases.

* Enabled emails sent from committer addresses (or any addresses in
LDAP) to bypass moderation across all mailing lists.  No changes to
SPF records for the foreseeable future.

21 May 2014 [David Nalley]

New Karma:

Andrew Bayer (abayer) was granted jenkins admin karma.


Infra spent or authorized to spend almost $3300 thus far in the new fiscal
year; all related to replacement hardware or service for hardware.

Operations Action Items:


Short Term Priorities:

* OSU Hardware failures We have a number of hosts that have degraded or dead
hardware in our Oregon colo. This is mixture of machines that are in and out
of warranty and involves machines that host both core services and less
important machines. Status is being tracked at:

* Outage recovery Coming out of our outages we have substantial number of
remediation items. In some cases the service has been restored but is not back
to pre-outage levels of operation.

* Builds.a.o Stabilization of Jenkins is a primary concern. Much work has
happened from volunteers and contractors alike (see comments below in general
activity as to improvement.) We are still suffering from service failures
every couple of days at this point.

Long Range Priorities:

* Monitoring Our monitoring still lacks the level of insight to provide
operationally significant information. Work continues on this front. Our new
monitoring system (Circonus) should come online in the next few weeks; but
much remains to be instrumented for it to be truly useful.

* Automation Slow progress continues on rolling out configuration management
in efforts to make our infrastructure better documented and more easily

* Technical Debt Work is ongoing to prioritize services infrastructure
provides and to set expectations and service levels around the services.

* Resiliency I wish that I could say that much work has occurred here; but
most of the month has been focused on outage recovery. The beginnings of that
work has taken place in working to restore a stable platform.
(see the note around hardware at OSU)

General Activity:

Infrastructure suffered three major outages in this reporting period. The
first involved the Buildbot host and a disk failure. CMS and Buildbot project
build were down for several days while the machine was rebuilt. The second
outage was the blogs.a.o service. You can see the details and remediation
steps that are being taken here: The
third was a 4 day outage of our mail services. You can see the results of the
post-mortem here: As of this writing
there is still a significant backlog of email being processed.  At current
rate, we expect the backlog to be cleared by May 16th.

The Buildbot host aegis lost a disk also and the machine was rebuilt over a
few days, changing from Ubuntu to FreeBSD 10. The CMS and project builds were
down whilst this happened. At the same time the Buildbot Master version was
upgraded to the latest release which caused some tweaks to the code and
project config files.

Infra has noted an increased level of concern regarding the CI Systems and in
particular the Jenkins side of builds. Some projects are concerned about the
level of support that Infra gives these systems. A combination of factors over
the last months has seen a decline in support - other higher priority services
taking up time, a decline in volunteer time, an increase in projects using the
systems and in parallel an increase in build complexity, all making for a
decline in available resources due to slave increases not happening in a
scaled manor to match. All this is being resolved as we speak and improvements
are being made; and there are many plans for the short/medium and long
term. The work done already is showing progress. In example on 2 May the
average load time for builds.a.o was 72.69 seconds and the average number of
builds in the queue was 65. On 13 May the average load time is down to 1.86
seconds and the average number of jobs in the queue is less than 4. Much work
remains to be done. For data see:

Plans to address existing issues:

As has been noted infra ran into a number of problems bringing a number of
key services back into use. There are a number of planned steps that are
either remedial or work around building a more robust foundation. All of
these tie back into the long term priorities you see above.

Below are things I have requested one or more contractors:

* SLAs - We're dividing up services into various criteria. Failures happen,
but our level and rapidity of response as well as the degree to which we
engineer for failure must be measured against how critical the service
is. The current plan is to submit the finished work to the President for
review and discussion with an audience he deems appropriate.

* Prioritization of hardware replacement - New hardware doesn't guarantee
against an outage. However, continuing to test the mean time between
failure for underlying hardware tends to increase risk on average. Along
with this prioritization; I've asked that each of the services being
replaced be done by a person who isn't the 'primary' for that service.
That list is not yet complete, but is being worked on.

* Documentation - Currently the quality varies from service to service.
Some of our documentation is clearly out of date, some is decent. My
experience is that most documentation suffers from bitrot in any
organization. However; I've requested multiple folks to bring our docs
for various services up to a usable state. I've also requested for
folks other than those who produced or will be producing the documentation
to review the documentation and use it to ensure it is accurate and adequate.

* Backups - In general, our backups, where they've been happening, have been
sufficient. We've already had work around documenting restoration from backup
get committed to our docs in SVN. Additionally short term tasks have been
handed out about establishing, verifying, or restoring backups as well as
checking that against the services and documentation. There's also tasks in
place to work on speeding up our restore timelines.

* Automation - We possess a lot of operational automation (scripts and other
tools that allow us to create or subscribe to lists, create users, etc.) We
have bits and pieces of infrastructure automation - but it's not widespread.
In the three outages we've experienced catastrophic failure of the hardware
resulting in the need to rebuild the service from scratch. Virtually all of
the moving pieces involved manual processes from OS installation to service
configuration. That dramatically increased our time to recovery; as well as
being prone to user error. To that end; I've requested the following:

 - Consolidate the number of platforms we support for core services. We
   currently have Solaris, Ubuntu, two major versions of FreeBSD. I've
   asked for a single version of Ubuntu and a single version of FreeBSD
   to be adopted across all of our non-build and non-PMC infrastructure.

 - Deploy an automated OS installation tool -  During the mail outage we
   had to get smarthands in the datacenter to burn a OS install DVD and
   deploy a fresh operating system twice. This meant that a ten minute
   task turned into more than hour in each case. I've set the criteria
   that we be able to deploy our installs over the network and control
   booting and other functions via an out-of-band management tool such
   as IPMI. We must also be able to host our own package repositories.

 - Configurations management -  We currently have puppet deployed but it
   isn't widely used within our infrastructure. Puppet permits you to
   declare state in it's domain specific language that controls how a
   machine is configured; what software is installed as well as collect
   data on the machine itself. Puppet also enforces state; and this
   enforcement is, quite frankly, better than documentation. Even if a
   machine is completely destroyed, by having done the work in puppet we,
   know the exact state of the machine and can deploy that exact
   configuration back to a new machine in a matter of moments. To that end
   I have planned the following items:

       - Training - Most of the infra contractors have not used puppet in
         anger. Beginning in the next few weeks; they'll make use of some
         gratis online training from Puppetlabs with plans for attending
         a hands-on class within 6 weeks. (budget-willing)

       - Mandatory use for new services. I've asked that all new work and
         services being stood up must be done using puppet.

       - Service restoration. For core services that have failed recently.
         we've either updated documentation or have tasks to do so. I've
         requested tasks for translating that into puppet manifests to
         dramatically reduce our mean time to recovery. For services that
         will move to new hardware; if that involves the recreation of the
         service I've asked that be done via config management as well.

       - Base OS deployment. The base OS deployment at the ASF is very
         well documented. In the case of FreeBSD it's ~26 individual
         manual steps that must be executed every time. In conjunction
         with work on an automated OS install; I've asked that all of the
         base OS deployment and configuration be automated via puppet.

 - Monitoring - put simply, our monitoring does not currently provide enough
   insight. In example, we did not know about the failing hardware underlying
   our mail service. According to our monitoring, things were fine. This isn't
   to say that knowing about it would have prevented the outage, but I would
   at least like the advantage of timely knowing about it. As mentioned
   elsewhere; when smarthands were working on our equipment they noted
   that many of our servers were complaining about hardware problems.
   Monitoring is largely grunt work; knowing what to monitor for each service
   is something that the contractors can rattle off. Actually setting up
   monitoring is a large time sink. We currently have a volunteer doing a good
   chunk of work; and my plan is to temporarily (3-6 month timeframe)
   supplement that with an outside contractor who is already familiar with
   our monitoring system and puppet.

None of the above prevents failure. It might give us an edge in detecting that
a failure is about to occur, or permit us to drastically reduce our time to
recovery; but it does not actually keep bad things from happening. The longer
term piece of this puzzle is to begin engineering our most important services
to be more redundant or more fault tolerant. Most of our services are not
setup this way. Our first target is going to be the mail service; we are doing
this for two reasons. First our experience with the mail backlog and the hoops
we had to clear to empty that backlog suggest that we aren't very far from the
limits of our current architecture. Second, as you've seen in the past few
weeks, a mail outage is absolutely crippling for the Foundation.

That said, please understand, that the problems we have, are not going to be
solved in the short term. By the end of the quarter I hope to be able to
report that we have a good start on these initiatives, but this is a long term
effort. Unless luck intervenes it's almost inevitable that we'll suffer
another outage this year.  Hopefully we'll be in a better place to respond to
those outages as we go forward.

16 Apr 2014 [Sam Ruby]

New Karma:

No purchases/renewals for the month since last report.

Operations Action Items:

Short Term Priorities:

* Look into mac build slaves.

* Converge on migration to eris. (Step 1 is merge git ->
 git-wip on tyr) (opinions?)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.
 (Support case closed, nothing useful came from it other than check the logs.)

* Port tlp creation scripts over to new json-based design on whimsy.

Long Range Priorities:

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

* Institute egress filtering on all mission-critical service hosts.

General Activity:

* New 3-year wildcard SSL cert purchased and installed for *.openoffice

* Thrift migrated to CMS with the aim of providing better support for similar
 sites.  Blog entry here:

* The number of confluence administrators was significantly reduced, this was
 to try and keep the list as small as possible.  Historically this permission
 level was required to operate and manage the autoexport plugin which has
 since been deprecated. see

* An inter-project communication site was requested by the community at
 ApacheCon and is being looked into by infra. This will essentially be an
 aggregator of project development wishes/requests, and will most likely
 reside on wishlist.a.o.

* As a way of lowering the bar for and securing security reports, infra is
 looking into creating a system which, based on LDAP, accepts and encrypts
 security reports for projects. The exact setup and nature of this system is
 being discussed, primarily with members of the subversion PMC.

* Heartbleed happened: see

* Two members of the infrastructure team attended Apachecon NA 2014 and had a
 few community sessions with committers to hear their concerns and attempt to
 address them.  Also met with Cloudstack members to discuss their widely
 publicized proposal for additional infrastructure needs surrounding project

19 Mar 2014 [Sam Ruby]

New Karma:
mdrob added to infra-interest.


Operations Action Items:

Short Term Priorities:

* Look into mac build slaves.

* Converge on migration to eris. (Step 1 is merge git -> git-wip on tyr)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken place.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.
 (Support case closed, nothing useful came from it other than check the logs.)

* Port tlp creation scripts over to new json-based design on whimsy.

Long Range Priorities:

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

* Institute egress filtering on all mission-critical service hosts.

General Activity:

* The new GitHub features have been well received, with 28 projects already
 onboard with the new features in February alone. As a result, the number of
 github related messages on the public ASF mailing lists have risen from 304
 in January to 3,616 in February, with expectations to exceed 5,000 in
 March. There has been a discussion on whether to transition from opt-in to
 opt-out on these features, but for the time being, it remains opt-in.

* Instituted a weekly cron to inform private@cordova about the current list of
 committers not on the PMC, which should be the empty set.  Currently about a
 third of the pmc is impacted with no indication that this will ever be
 addressed by the chair- the requisite notices have already been sent to

* Discussed the current state of affairs with our build farms as they relate to
 TrafficServer's needs.  We intend to address this with increased funding in
 next year's budget.

* Received a report about several compromised webpages hosted by VM's
 associated with OfBiz.  In the process of working with the PMC to correct
 this situation.

19 Feb 2014 [Sam Ruby]

New Karma:


Board Action Items:

Short Term Priorities:

* Look into mac build slaves.

* Converge on migration to eris. (Step 1 is merge git ->
 git-wip on tyr) (opinions?)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.  (Support case closed, nothing useful came from it
 other than check the logs.)

* Port tlp creation scripts over to new json-based design on whimsy.

* Ensure all contractors are participating in on-call situations, minimally by
 requiring cell-phone notification (via SMS, twitter, etc) for all circonus

* Explore better integration with GitHub that allows us to retain the same
 information on the mailing list, so that vital discussions are recorded as
 having taken place in the right places (if it didn't happen on the ML...).

Long Range Priorities:

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

* Institute egress filtering on all mission-critical service hosts.

General Activity:

* Migrated from backups of thor to eris.  Unfortunately a
 dozen commits were naturally lost in the process.  Thanks to TRACI.NET for
 providing additional bandwidth for this purpose.

* Jira: Jira is now runnning on Apache Tomcat 8.0.0 (rather than 7.0.x). While
 running on 8.0.x is unsupported by Atlassian, this is providing valuable
 feedback to the Tomcat community. To mitigate the risk of running an
 unsupported configuration, Jira is being monitored more closely than usual for
 any problems and there is a plan in place to rollback to 7.0.x if necessary.

* At the behest of committers, we have started working on a stronger
 implementation of GitHub services, including 'vanity plates' for all Apache
 committers on GitHub.  A method of interacting with GitHub Pull Requests and
 comments has been completed, that both interacts with the GitHub interface
 and retains all messages on the local mailing lists and JIRA instances for
 record keeping. At the time of writing, we have 367 committers on the Apache
 team on GitHub. We have made a blog entry about this at which seems to have reached many projects
 already.  Furthermore, the Incubator has been involved in the development of
 this, and are thus also aware of its existence and use cases.

* The new SSL wildcard was obtained from Thawte earlier this month, and will
 be rolled out to services very soon. Thanks to jimjag this got the business
 end of the deal done so we could actually get the cert in before the incumbent

* All remaining SVN repos have now been upgraded to 1.8.

* Resurrected thor (mail-search) after soliciting help from SMS for on-site

* Amended release policy to provide rationale and spent time explaining the new
 section to members@.  See

* Work with Cordova on processing their historical releases to comport with

15 Jan 2014 [Sam Ruby]

New Karma:


Board Action Items:

Short Term Priorities:

* Look into mac build slaves.

* Converge on migration to eris. (Step 1 is merge git ->
 git-wip on tyr)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken

* Look into rsync backup failures to abi. Look into clearing out a lot of room
 on abi - currently 20GB left and 20GB+ a day gets backed up.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.
 (Support case closed, nothing useful came from it other than check the logs.)

Long Range Priorities:

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache

General Activity:

* Confluence: Finally got it to upgrade to 5.0.3. Database edits and
 conversions were needed to make the transition. After a few days bedding in
 it seems to be performing much better than the previous version.

* Translate.a.o: Upgraded to 2.5.1-RC1 (that is a release). Severe
 compatibility issues. Reprogrammed part of LDAP connection, to make it more
 stable (and work).

* [2nd Jan 2014] - Jenkins Master was migrated to a much needed new server.
 This also eases the pressure from Buildbot Master since the split of hosts.

* Migrated SVN repositories to newer, larger, and hopefully quicker array on
 Dec 31st. The repository upgrades will now be done in the coming weeks once
 we have seen stability in the Infra repository for at least 1 week. We will
 then likely re-purpose the SSD in the old array and add them to the new
 array for improved caching. Total downtime for the move was 1h15m as the
 prep work had been undertaken for at least 2 weeks before.

* RE: Symantec code signing service - There are a handful of internal tasks to
 complete before we can move on.

* Migration and reinstallation of continuum-ci.a.o (was vmbuild.a.o) has taken
 [Final checks are in progress before announcing its GA]

* [5th Jan 2014] - was upgraded by the roller project

* Faulty gmirror disk on eris, liaised with OSUOSL and swapped out disk.

18 Dec 2013 [Sam Ruby]

New Karma:


* Funded Daniel Gruno's attendance at EU Cloudstack conference: cost TBD.

Board Action Items:

Short Term Priorities:

* Clear the lengthy backlog of outstanding tlp-related requests.

* Repurpose the new hermes gear for use as a (jenkins?) build master as that is
 more pressing.

* Investigate the migration tooling available for conversion from VMWare to
 Cloudstack [See attachment INFRA-1].

* Look into mac build slaves.

* Migrate eris svn repos to /x2, converting everything to 1.8.

* Converge on migration to eris.

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call is being arranged.

* Look into rsync backup failures to abi.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Continue with the effort to reduce the overwhelming JIRA backlog. At the start
 of the reporting period we started with 134 open issues. We are now down to
 ~90 open issues.

* Jan Iversen has been pushing through the outstanding Tlp requests for virtual
 machines.  Several projects should by now have their VM.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Needs an intermediate upgrade to 5.0.3 then to latest.
 Attempts have been madse and failed to upgrade to 5.0.3, opened a support case
 but we are cotninuing to try on a test instance.

Long Range Priorities:

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

General Activity:

* Both new tlp's this month, Ambari and Marmotta, were processed within 24
 hours of board approval.

Attachment INFRA-1: Cloudstack Conference Feedback [Daniel Gruno / humbedooh]:

Attended CCC (Cloudstack Collaboration Conference) at Beurs van Berlage
in Amsterdam. Tried out Cloudstack locally with a /27 netblock, as well
as on testing platforms available at the conference. Apart from minute
errors in the UI (which I have reported), it seems to be working as
expected. Cloudstack supports LDAP integration, however this is not a
feature complete integration, and it is my view that an infra-made LDAP
implementation - with regards to _non-infra involvement_ - is preferred,
though we may elect to use it for the administration of the hosts.

Attended a talk about Apache LibCloud which seamlessly integrates with
Cloudstack for an easy programmable management of VMs via Python. This
removes the need for dealing with the rather cumbersome Cloudstack API,
and enables the possibility of creating an infra-managed site for
dealing with VMS in several ways. Should we ultimately decide on another
cloud solution, LibCloud integrates with just about every platform out
there, and so would not be affected by this to any large degree. I did
not get a chance to properly test LibCloud, so my findings in this regard
will have to be substantiated at a later date.

Cloudstack offers support for both VMWare (WebSphere), Xen(cloud), KVM,
so migrating is just as much a question of "if" rather than just "when".
It supports using different hypervisors on different pods (a collection of
hosts), so working in tandem with a KVM or similar free hypervisor is an

Migration options (assuming we go with KVM or similar):

A) Dual hypervisor mode (use both WS and KVM, only allot new VMs on KVM?)
B) Migrate WS boxes to KVM (Qemu-KVM supports this natively with VMWare
  version 6/7 disks)

if A, then we need to use separate pods for WS and KVM.
if B, then we pull boxes offline, one by one, move the images
     to the new host and KVM can handle the images.

Tentative proposal for future VM management:
Create one or more hosts with KVM in CS(or OS), assign a pod to the old WS
clients, use Apache LibCloud within an LDAP-authed site (TBD) where PMC
members can request, restart, get access to, and resize (to be acked by
infra) instances. Liaison with Tomaz Muraus(LibCloud), Chip
Childers(Cloudstack) etc on the actual implementation details. This
would mean that infra's only role would be to ack the creation/resizing
of VMs and general oversight, rather than manual creation/modification
of each VM. I expect to have a mockup of what such a site could look like
ready for infra to review and discuss medio December, thus adding something
of value to the next board report about it.

Jake Farrell has offered to help with the CS setup, as he has experience
running this in large environments.

There have been some discussions of maybe using other management platforms
instead of CloudStack, but given that CloudStack and LibCloud are Apache
projects, it is my opinion that we are easier suited, support-wise, by
using software developed by the foundation, as well as the proverbial
"eating our own dog food".

20 Nov 2013 [Sam Ruby]

Discussed funding a pair of contractors to attend a Cloudstack
conference to gain additional skills- approved by VP Infra.  Only
one will actually attend.

Acquired a free license for Jira Help Desk - rollout forthcoming.

Installed wildcard SSL cert for *

In pursuit of outsourced code-signing capability for project
releases.  Negotiations have reached the NDA phase.

Migrated the bulk of our SQL infra to a centralized database server.

Discussed replenishing our Mac build infra.

Purchased a wildcard cert for *  3 years
at $475 per year.

Began holding informal weekly meetings via google hangouts.  Open
to all infra-team members.

Had a configuration regression regarding the PIG and DRILL Confluence
wikis, which allowed additional spam to reappear on those spaces. DNS reacquired from our registrar.  Somehow it wasn’t
configured to autorenew so we lost that domain for a few days.

We are still considerably behind the curve in our Jira workload and
that is starting to inform some of the reporting at the board level.
Please be patient while we continue to ramp up with existing personnel
to support the org’s continued growth.  In response we have organized
a monthly jira walkthrough day dedicated entirely to outstanding
jira requests.  Raw jira stats show we have made significant progress
over the past month and we expect that trend to continue, with 116 opened
vs. 166 closed.

Aegis is reporting a bad disk and it needs to be replaced as the host
is seriously underperforming in its current state.

Dell has solicited a warranty renewal offering for arcas, our jira server.

We need to sort out licensing for our VMWare infra as we are currently in
a holding pattern for new VM’s until this gets resolved.

We’ve disabled the user ability to edit their profile page in confluence,
eliminating another common source of spam.

16 Oct 2013 [Sam Ruby]

An onslaught of Confluence spam required us to change
the default permission scheme to match what we've done
for the moin wiki.  We've also formally withdrawn all
support for the autoexport plugin.

Closed out the account of a deceased committer.

Received delivery at OSUOSL of the gear we ordered last
month.  Now in the process of bringing it online.

Upped the default per-file upload limit for dist/ to
200MB (from 100MB).

18 Sep 2013 [Sam Ruby]

Discussed recent scaling issues with roller and made
appropriate adjustments to the install.

Discussed picking up an SSL cert for

Dealt with disk issues in the Y! build farm.

Dealt with high CPU consumption on the moin wiki.

Dealt with a vulnerability on the analysis VM.

Dealt with disk performance issues on erebus (VMware).

Discussed creating a dedicated database server again.

Dealt with a wide brute force password guessing attempt
against our LDAP database.  About 800 users were impacted,
none of them apparently had their passwords guessed.

Replaced a bad disk in hermes (mail).

Setup an organizational account with Apple to allow devs
to put their wares in the App Store.  Cordova will be the
first guinea pig.

Ordered some new Dell gear for slated for replacements of
existing hosts.  Trying a new supplier largely for cost

Trying unsuccessfully to get our free VSphere license

A contributor inadvertently included customer data in two bugzilla
attachments, and politely requested that it be removed.  This request
was initially denied based on a careful reading of the current policy.
Subsequently, the author of that policy described his intent, the
contributor provided more information as to what they were requesting
to be removed and why, and as a result, the request was implemented.
The infrastructure team plans to revisit whether or not the policy
needs to be updated.

As to the potential policy change, there is a saying in legal circles
that "Hard cases make bad laws"[1] -- as well as a saying that "Bad
law makes hard cases"[2].  To some extent, both apply here.  The
overwhelming majority of requests for deletions are for people who
want something removed from a mailing list that is widely archived and
mirrored.  Often these requests come in after a considerable period of
time has elapsed.  For these reasons, it probably is best that the
documented policy continues to set the expectations that most requests
will be denied -- and further I believe that we should be open to
granting exceptions whenever possible.

Reflecting on (a) the low frequency with which exceptions will be
granted, (b) the amount of effort it took to resolve this, perhaps the
simplest thing that could possibly work would be an addition of a
statement like the following:

Exceptions are only granted by the VP of Infrastructure; request for
removal of items that have already been widely mirrored outside of the
ASFs control are unlikely to receive serious consideration.


21 Aug 2013 [Sam Ruby]

A report was expected, but not received

17 Jul 2013 [Sam Ruby]

Discussed the logistics of bringing the ACEU 2012 videos online.

Dealt with the fallout surrounding the recent javadoc vulnerability.

Discussed upgrading our VSphere license with VMWare.

We continue to iron out kinks with our new circonus monitoring service.

Added a brief mission statement here:

Upgraded svn on eris ( and harmonia ( to current versions.

Discussed making shell access from an opt-in service.

19 Jun 2013 [Sam Ruby]

Purchased 9 disks from Silicon Mechanics to fill out the eris array -
cost ~$1400.

Promoted Jan Iversen to the Infrastructure Team.

Had oceanus (new machine) racked in FUB.

Mark Thomas kicked off a new round of FreeBSD upgrades.

Disabled CGI support for user home directories on

Purchased a wildcard cert for * at $595/year from digicert.

Daniel Gruno was given root@ karma and will need to be added to the
committee as a result.

Setup a new VM with bytemark for circonus-based monitoring.

Work continues around the Flex Jira import problems.

Acquired several new domains for management by us instead of external

15 May 2013 [Sam Ruby]

Completed our budget deliberations including funding for a new
part-time position.

Purchased 3 new HP switches to replace our aging Dell switches.
Cost ~ $4700.

Continued discussion of code-signing certificates for our

Dealt with some failing/overloaded build machines in our Y!

Jan Iversen continued to work on our nagios -> circonus
service monitoring migration.

2 disks have failed in loki (tinderbox), we've replaced one from
inventory but will need to order more to complete the replacement.

Experienced some security / porn issues with the moin wiki and
have upgraded to the latest version to assist with controlling
the spam.

We will be disabling password-based ssh access to
in the near future, once the supporting scripts have been tested.

Rainer Jung was granted root karma and needs to be added to the
formal committee roster.

17 Apr 2013 [Sam Ruby]

A report was expected, but not received

20 Mar 2013 [Sam Ruby]

About to spend ~$1500 for additional drive capacity for eris (

Updated our inventory with Traci to better align with their power-cycling

Set up wilderness.a.o lua playground for Daniel Gruno.

Granted danielsh and gmcdonald IRC cloak granting karma.

Upgraded bugzilla to 4.2.5 in response to a vulnerability announcement.

Jira was upgraded to the latest stable (5.2.8).

Daniel Gruno setup a direct SMS service for root-users to take advantage of.

Again discussed timing issues surrounding the dissemination of authoritative
declarations about newly minted TLPs.

Setup - a pasting service for committers to use.

Still dealing with the fallout of our failed rack-1 switch.  Now pursuing
indirect support with Dell.

Received/deployed the new box for an additional vmware hosting service.

Upgraded httpd on eos and aurora(www/mod_mbox service) to 2.4.4.

Restored archiving for the EA mailbox.

Jan Iverson upgraded mediawiki for Apache OpenOffice due to an announced
vulnerability issue.

Working with concom to setup a USB disk with video data on it in one of
our OSUOSL racks.

Pruned the apmail group list down to the relevant current active folks.

20 Feb 2013 [Sam Ruby]

Placed a $15K order for a new vmware server with Dell which is now on
backorder through February.

Still dealing with the fallout of losing one of our public switch

Uli is working on OOB access for us at FUB.

Specced some additional drive capacity for eris.

Discussed setting up as a pasting service for Apache
based on Daniel Gruno's site.

Started the process of reigning in the abusive maven traffic to

Started the process of dealing with the missing Flex attachments for a
Jira import.

Shut off the people -> www rsync jobs for our websites. All project sites
now MUST be on either svnpubsub or the CMS to continue to be maintained.

Enabled redirects for our services for graduated podling

Upgraded the software on adam (OSX) for $40 thanks to Sander Temme.

Was contacted by Traci to update our inventory with them.

16 Jan 2013 [Sam Ruby]

Rainer Jung and Jan Iversen have done a stellar job of
coordinating and collaborating on a new wiki host for
openoffice, among other activities from Rainer in particular.

Discussed with the secretary the best way to populate
a new tlp's LDAP groups from either the unapproved minutes
or the agenda file.  After a bit of back and forth we
settled on the agenda file for the time being, however
this effort remains a convenience for the incoming chair,
not a means of setting up the groups in a permanent and
official way.  The chair remains responsible for vetting
the groups post-setup.

Had a robust yet not entirely satisfying discussion with
the membership about relaxing ACL's for our subversion
service, culminating in the following url

It is expected to take an extended period of time before
such changes can be effectively implemented as infra policy,
but the goal of simply raising awareness has been met.

We are in the final phases of withdrawing support for rsync-backed
websites, which we expect to complete before the end of the month
rolls around.  At this time there are still several outstanding
projects who have yet to file a jira ticket with us to migrate
to either svnpubsub or the CMS, and their ability to continue
to service their live site with updates and new content will
be impacted.

Daniel Gruno and others have been working on a gitpubsub
service over on github and have rolled out a demo version
to our live writable git repos.  We expect even more coordination
between the svnpubsub service maintained by the subversion
crew and the gitpubsub service Daniel and others have worked

19 Dec 2012 [Sam Ruby]

Lost our public VLAN in our rack1 switch for undetermined reasons,
probably due to a misconfiguration on the OSUOSL side.  Will continue
followup with OSUOSL for eventual resolution.

Enabled core dumps on one of our mail-archives servers to better diagnose
the nature of the ongoing segfaults.

Specced a new VMWare host to offer additional VM's to our projects, then
haggled with each other over the config.  Now appears we're going to
repurpose chaos (36 disk enclosure) to serve up a Fiber Channel interface
to the new host.

The tlpreq scripting is now in place and ready for new graduating projects
to use.  We will pass along the details to the Incubator for podlings due
to graduate in December.

Decided we're comfortable with the OpenOffice project keeping at most 2
releases on the Apache mirror system (/dist/) at any one time.  Henk
Penning has communicated this back to the AOO PMC.

Came across a bizarre privacy information leak with Jenkins and LDAP.
We've patched our installation to mitigate the issue.

Discussed our near-term plans for git hosting on various lists.  One of
the exchanges was needlessly heated, and we have tried to rectify the
situation with better documentation and a bit less BOFH tactics.

21 Nov 2012 [Sam Ruby]

Restructured the creation of new mailing list infrastructure:
new "foo" lists will be named following this convention:


instead of the now antiquated "$".

Similarly restructured website assets for new podlings to use


instead of the prior "$podlingname/".

These changes will help make migration to TLP status easier for
both the podling and the Infrastructure Team.

Coordinated with Sally with respect to the Calxeda/Dell ARM donation.

Worked with OSUOSL to mitigate the downtime of a series of
scheduled network outages affecting our .us services.

Discussed the pros and cons of using Round-Robin DNS for websites.
No action taken moving us away from RR DNS.

Rainer Jung upgraded our webserver install on eos ( to the
latest and greatest version of 2.4.x.

Mark Thomas upgraded all 3 bugzilla instances in response to a security
vulnerability report.

Some generic stats detailing the org's recent growth:

 New committer intake: ~300 / year like clockwork over the past decade.

 New TLP graduations:
                       2010: ~20
                       2011: ~10
                       2012: ~30

 New INFRA issues:

 Mailing List / Subversion activity: [*]

 Average Subversion traffic (hits): consistently ~ 3.2M / day for the past
 few years.

 Inbound Mail traffic: 250-300K connections per day for the past few years.

 Average Website traffic (hits):
                        Nov 2010: 10M / day
                        Nov 2011: 11M / day
                        Nov 2012: 21M / day [**]

 Rough Download Page traffic (hits):
                        Nov 2010:  48K / day
                        Nov 2011:  46K / day
                        Nov 2012: 145K / day [***]

 46 Virtual Machines (18 new within the past year) and 24 additional ARM
 servers due to the Calxeda/Dell donation.

[*] - clicking on the mailing list graph shows incubator + hadoop + lucene
   is now responsible for 40 % of the org's total mailing list traffic.

[**] - 10M / day due to

[***] - 100K / day due to openoffice (note openoffice users typically upgrade
using the openoffice software itself rather than by visiting the download webpage)

AI: Sam follow up with regard to git post-commit hook support by infra

17 Oct 2012 [Sam Ruby]

Expressed some concerns about the ongoing volunteer
support and documentation for nexus (

Spoke with a few PMCs about cleaning up temporary artifacts
in their website rsync ops.

Discussed infra meetup plans to coincide with Apachecon EU.

Due to the fact that the service was shut down, we
are considering other options to provide the same functionality
to our projects.

Dealt with some issues surrounding the generation of pages.

The contract for colocation service with FUB has been signed
by FUB and awaits our counter sign.

Started discussing various approaches to simplify the podling
graduation process from an infra standpoint.

19 Sep 2012 [Sam Ruby / Jim]

Daniel Gruno added to the Infrastructure Team.

Bought a pair of 1TB SATA drives, one of which was
used to replace a bad disk in hermes (mail).

Calxeda donated access to a 24-node ARM-based hosting
service to be deployed as part of our build farm offerings.

Removed the custom jira patch licensing plugin largely
because no one wanted to continue to maintain it.

Migrated jira service to its own hardware (a spare r410)
for stability reasons.

Picked up a 7 more SATA drives for inventory and replacement.

VP Infra pointed out various rumblings rising to the board
level about contractor communications, and had various ideas
about how to address that.

Started work on a Circonus-based service to replace our aging
nagios installation.

AI Jim follow up with infra regarding git status

15 Aug 2012

A report was expected, but not received

25 Jul 2012 [Sam Ruby]

Updated the mailing list creation process- see

Determined that, one of our Apple Xserves,
is no longer capable of productive service due to various
hardware faults.

Discussed acquiring additional "cloud" services for our
projects to use, ostensibly thru some unspecified bidding
process.  Nothing much came of it.

Approached by Dell regarding warranty renewal for selene
and phoebe (Geronimo TCK build farm).  We declined.

More discussion, much of it less than constructive, about
providing a digital signature service to Apache project

Worked out a deal with Calxeda to provide a few ARM-based
build servers for our projects to use (at no cost to us
other than admin time).

Granted Philip Martin of the Subversion project access to
eris (US svn server), mainly for his offer to help with some
svn server debugging.

Working with our main DNS provider to stabilize our
account services to better deal with the dozens of extra domains
the AOO project needs us to host.  We've been getting gratis
service to this point, but we are willing to pay for better
responsiveness and additional features only available with a
paid support plan.

Discussed plans for an infra meetup to roughly coincide with

Work on the backup system migration from bia (in Los Angeles)
to abi (in Fort Lauderdale) is nearing completion.

Some progress was made on getting the number of outstanding
jira tickets down to normal levels.  We've reclassified tickets
based on whether they are "waiting for user" input or "waiting for
infra", which has helped, but the bulk of the open tickets still
are "waiting for infra".  We expect to continue to make progress
on this over the coming days and weeks, and will continue to report
on it until we are satisfied things have returned to an acceptable

Daniel Gruno put together a nice comments service at
for project websites to take advantage of.

20 Jun 2012 [Sam Ruby] has returned to normal service post-upgrade; thanks to Dave
Johnson for the particulars.

Migrated our jira instance to our shiny new phanes/chaos VM cluster and it
seems to be performing rather well now.  Thanks go to Dan Kulp for debugging
the svn plugin for us, as well as culling the jira-administrators group to sane
levels (projects will need to make better use of roles).  Upcoming work to
include flushing the backlog of pending jira imports, which we've now started
with Flex.

Discussed options for "Cloud support" for a certain GSOC project. Conclusion
was that we don't currently have a suitable arrangement worked out with a cloud
provider to offer the ASF the enterprise setup that we'd require.

Did the password rotation dance again- this time there were no malicious
activities surrounding the action.  See for details.
The situation has since returned to normal now that we've reenabled committer
read access to the log archives on

Discussed java hosting futures with members of the FreeBSD community in
light of the fact that Atlassian does not consider FreeBSD a supported
platform, which at least partially motivated our move of jira from a
FreeBSD jail to an Ubuntu based VM.

imacat added to the Infrastructure Team to help support the ooo wiki and
forums platforms.

Work on new harmonia is currently underway- colocation provided by Freie
Universität Berlin (FUB).  Uli Stärk is our lead on this.

Decided not to pursue a meetup at the Surge conference this year, preferring
to get together at one of the upcoming Apachecon conferences.

Discussed support status for git and documented our plans for bringing
it to a fully supported service over the next 3-6 months, culminating
in the following awkwardly tautologous statement by VP Infra:

 "The infrastructure team has four full time contractors and a variable
 number of volunteers and is committed to supporting both git and

Experienced extended downtime for after an OS upgrade
busted our install.  Dan Dumont from Apache Shindig has been assisting
us in recovery- we expect the service to return to active status by
the time of the board meeting.

We've fallen a bit behind in our caretaking of jira issues mainly due
to a number of new graduations from the Incubator.  We'd like to return
the number of outstanding issues to "normal levels" within the next reporting
period, which seems a reasonable goal given the expected (small) number of new
graduations happening this month.

We upgraded (aka minotaur) in light of the recent security
reports from FreeBSD concerning a local root exploit.

Considerable work has gone into scripting various workflows around common
requests like mailing lists, git repos, and CMS sites.  We've subsequently
created to house these efforts once they've fully gelled
from their development versions at

Daniel Shahaf is stepping back to part-time for three months starting in July.

16 May 2012 [Sam Ruby]

Added Mohammad Nour El-Din to the Infrastructure Team.

Upgraded Jira to version 5- kinks still being ironed out.

Had OSUOSL install our recent purchases from Silicon Mechanics
and Intervision.

Coordinated with the OpenOffice PPMC and Sourceforge regarding
the distribution of AOO's 3.4.0 release.  Most of the traffic
was handled by Sourceforge's CDN instead of our mirror system,
at a rate of over 20 TB of download traffic a day.  Download
stats will be published by the AOO PPMC in the near future.

Replaced isis's bad disks and brought up 2 additional build
hosts at our Y! datacenter.

Discussed another F2F meeting during the Surge 2012 conference.
Not a lot of expressed interest so far.

Experiencing uptime issues with our Bytemark
VM ever since they migrated the VM to different hardware.

Discussing git hosting options again, and again, and again...

18 Apr 2012 [Sam Ruby]

Coordinated the installation of aurora ( in SARA
with Bart van der Schans.  We are now out of free space
in that datacenter.

Our new backup server abi is in service at Traci.
We've arranged for Sam to have access to the joes-local
safe deposit box in an emergency.

Worked out an informal deal with SourceForge to assist
with the delivery of OpenOffice releases.  Whether or
not this continues to be used beyond the first release
is up to the AOO PPMC.

Henk Penning is putting the final touches of providing
optional Apache-mirror support for OpenOffice releases.

Bought an array from Silicon Mechanics for about $12K.
The host to attach it to will be purchased through
Intervision in the very near future.

Updated the AOO bugzilla logins to reflect the fact that
we host the domain but no longer allow mail
for it.

Considering alternatives to directly importing Flex's jira
data into the main jira instance due to repeated failed attempts.
Atlassian refuses to assist us until we can demonstrate the
identical problem on a supported platform.

Submitted our budget for review and approval.

Harmonia's ( disk subsystem finally stopped performing
well enough to continue using it as our European svn slave.
Specced a replacement host with Uli Stark's help.

Fleshed out a deal with Freie Universitat Berlin for colocation
services, targetting new harmonia as the first host to deploy there.

Set a soft limit of a combined 1GB worth of release artifacts
dropped onto the mirror system- anything exceeding that figure needs
to coordinate with infrastructure in advance.

21 Mar 2012 [Sam Ruby]

All outstanding bills have been paid.

Spoke with the myfaces PMC about their public-facing
maven repo on their zone. The zone has been taken down.

Discussed replacement purchase for harmonia (
and and additional VMware server for linux vm's.

Upgraded the majority of our FreeBSD servers to 9.0.
Will complete the remaining ones in the near future.
Next time round we will likely enable dtrace throughout.

Discussed infra's authority to pull improperly-signed
releases from the mirrors.  VP Infra concurs we can/should
do this if appropriate justification is available.

Mark Thomas upgraded our Bugzilla installs to 4.0.4.

Dealt with our monitoring host being redeployed to a new
VM server.

Discussed hosting a yum repository for project releases.
No decision was made at this time, pending further input
from volunteers.

Discussed deploying an Australian svn mirror to Gav's
local server.  No decision was made at this point.

Greg Stein has taken svnpubsub over to Subversion's trunk
and has done a lot of work on improving the svnwcsub service.
We look forward to seeing svnpubsub distributed in a subversion

Established the precedent of not resetting accounts for
returning committers that are no longer a part of any active

Experiencing chronic problems with isis, one of our build hosts
in Y!'s datacenter.  Y! has offered us a pair of new servers
to supplant it.

Began discussion of a budget for FY2012-2013.

Posted a few CMS-related blog entries to
describing recent activity.

15 Feb 2012 [Sam Ruby]

Still attempting to pursue a github FI instance for ASF use.

Intervision LOC app has still not been filled out and sent off.

Purchased an HP switch for SARA for 456 EU.

Renewed warranties on Dell and Sun gear for 1Y with Intervision
and Technologent, respectively.

Prepped aurora and the HP switch for installation in SARA.

Renewed DNS for another 9 years.

Silicon Mechanics reminded us of our outstanding $2037 credit
with them.

We've sent out a notice to all PMCs about the plan to migrate all sites
to svnpubsub or the CMS by the end of this year.

24 Jan 2012 [Sam Ruby / Jim]

Still attempting to pursue a github FI instance for ASF use.

Intervision LOC has still not been filled out and sent off.

We've determined there is adequate space (9U free) for us
to install our new gear in SARA.  However we are in need
of an additional switch, which is tasked to Uli Stärk for

Awaiting a final decision from the OpenOffice podling regarding
hosting of extensions and templates.  They are considering
either hosting those services locally at the ASF or with

Joe spent a few hours training Melissa (EA) on svn use.
A google calendar is now available for everyone's use
thanks to Melissa.

Secured and rolled out a wildcard cert from Thawte.

Dealt with a minor security issue in the Nexus installation,
reported by Sebastian Bazley.

Partitioned our SSL termination for our linux hosts to separate
VM's for additional security.

Began deploying puppet for eventual management of our linux hosts.

Dealt with a benign security incident on

Further enhanced CMS performance with the introduction of parallel

21 Dec 2011 [Sam Ruby / Jim]

Floated the concept of providing Sam Ruby a place to supply
a grab bag of useful CGI scripts he has developed over
the years.  Originally hosted on the server
as but is in the process of being redeployed
to a separate VM named

Rejiggered switch 2 (our "private" switch) with OSUOSL
help to better align it with our goals of a single switch
per cabinet.

Setup for dedicated service to the OpenOffice
extensions and templates services.  As these services include
providing downloads for non-open-source licensed products,
explicit permission by VP Infra for this purpose was granted.
VP Infra delegated the final decision on licensing to the
Incubator PMC once OpenOffice proposes to graduate.

Setup which is a pootle-backed service
for projects to use in their language translation efforts.

Initiated "Git Friday" to focus on git-related support issues
across the whole team on a weekly basis.

Discussed installation of new aurora into SARA with Bart van
der Schans and Wim Biemolt of SARA.  Progress is gated on
Wim getting back to us on the status of our to-be-determined
additional rack space.

Tony Stevenson put in his notice to return to paid contractor
status starting Jan 1.

Opened the floor to input on contract terms for sysadmins, in
particular the immediate part-term post(s).  General feedback
on the idea of including explicit metrics in the terms was

Wrote a testimonial for the FreeBSD Foundation regarding our
FreeBSD deployments.

Gave the CMS a big performance boost by replacing the rsync
usage with zfs clones.

Put out an RFP for participation in our ongoing alpha test
of git hosting.  7 projects responded; all 7 were accepted:
 wicket, callback, cassandra, s4, cassandra, deltaspike,
 trafficserver, deltacloud.
Participating projects are required to provide an infra
volunteer to work on the git hosting code.

Got Thawte's nod for a wildcard cert via Bill Rowe's contacts
(waiting on Sam Ruby to chase up Brian the sales rep).

16 Nov 2011 [Noirin Plunkett / Jim]

Accepted a "loan" from IBM for a PPC64 server to be
incorporated into our build farm.

Welcomed the new VP Infra Sam Ruby to the leadership
post for Infra.

Changed our plans from simply replacing the bad disk
in harmonia to replacing the entire host asap.

Completed the migration of the domain
to ASF control.

Discussed mirroring options for the large set of artifacts
produced in OpenOffice releases.

Registered the domain with our dotster account.

Held a minor infra meetup at Apachecon led by Philip Gollucci
to discuss transition issues for the VP role.

Discussed a few proposed mail templates to send out regarding
our svnpubsub migration plans for dist files and websites.

Pushed to resolve a few outstanding DNS migration issues for
the subversion, spamassassin, and libcloud domains.

26 Oct 2011 [Philip Gollucci / Jim]

Board Action Items:
Intervision LOC needs to be signed,filled out,and sent
approved/dell-2011-05.pdf needs to be moved to paid by treasurer@
approved/Noirin-Infra-flights.txt needs dealing with

No other items are expected to be changed in Staff SA contracts.
Need signature by an officer who is aware of the new

General Activity:
The harmonia ( bad root disk situation remains
unresolved.   We expect to purchase a replacement disk
with Uli's help and have Bart van der Schans install it

Bart has received and tested the replacement server
for aurora (  We need to schedule a date for
decommissioning and eventual reinstall of aurora soon.

Held an infra meetup to coincide with the Surge conference
in Baltimore, MD.  Discussed plans for the remainder of the
fiscal year. Tabled the staff review b/c the president could
not attend.  We'll reschedule.
See notes here --

Pursuant to explicit authorization from the board, started
the git hosting experiment with CouchDB as the initial guinea

Renewed the domain for 1 year.

Updated's operating system in light of the
published local-root vulnerability in FreeBSD.  Other FreeBSD
hosts are scheduled to be upgraded to FreeBSD 9.0 on release.

Began work rationalizing the state of our switches at OSUOSL.
The current situation is a mess, we're moving to a one-switch-
per-rack configuration with no cross-cabling between racks
other than through the patch panel.

Re-initiated the transfer of the domain to the ASF's
dotster account.

Renegotiated the financial terms of Gavin's and Joe's contracts.

Began talks for another infra meetup sometime during Apachecon NA 2011.

21 Sep 2011 [Philip Gollucci / Jim]

New Karma:


Payments to staff were about 1 week late this month.

Board Action Items:
Intervision LOC needs to be signed,filled out,and sent

approved/dell-2011-05.pdf needs to be moved to paid by treasurer@

General Activity:
Response time on new account requests remains under
2-3 days.

The harmonia ( bad disk situation remains unresolved
at this time.

Ordered a replacement host for aurora ( from a Dell
reseller in Germany. Cost = 5259.80 EU.  It is to be shipped
to Hippo for eventual installation by Bart van der Schans.

Daniel Shahaf improved our automated banning of abusive IP
addresses with respect to svn traffic.

Failed to successfully incorporate Terry Ellison of the (ooo) project into the infrastructure community.
Terry was working on migrating the existing wiki and forum
services for that project to ASF gear, but gave up after
being frustrated with his interactions with the ooo community
at the ASF and the infrastructure team in particular.  His
volunteer efforts will be missed.

Mark Thomas successfully migrated the ooo bugzilla instance
to ASF gear.

Mark Thomas also improved our svn traffic banning schemes.

Upgraded our relative state of paranoia following breakins to and

Lost a disk in hermes' (mail) zfs array which was subsequently
replaced with an existing spare in the rack.  We need to look into
purchasing another spare of the same specifications for future
disk failures, as there are none left for us to use in the rack.

17 Aug 2011 [Philip Gollucci / Jim]

Setup and racked all the gear for the backup solution at TRACI.
Philip flew down to assist, and we setup a safe deposit box to
store tapes offline.

Harmonia ( lost a root disk and reported errors with its zfs array.
The errors were subsequently cleared but we need to look into a
replacement root disk for the one we lost.

After more complaints about delays in the account creation process,
Sam Ruby created a script to automate the input processing for new
requests.  Together with more root people participating in the process,
this has significantly cut response times from several days to just
a few.

Upayavira continues to work on the selfserve infrastructure, which will
someday completely replace the existing account creation procedure.

Uli Stark is finalizing the specs for the EU replacement host for
aurora (

Setup for the Openoffice MediaWiki install as well as the Openoffice
forum site has made significant progress.  Terry Ellison has led the
effort for both.

Andrew Bayer approached us with an offer to provide either a cash
donation or hardware donation for additional build slaves.  Serge
explained the situation with targeted donations and our past experiences
with hardware donations at the ASF.

Daniel Shahaf upgraded our svn installation to 1.6.17 on eris, thor,
and harmonia in order to prevent further loss of commit emails.

Cleaned up the root@ alias, adding the President Jim Jagielski to it.

Doubled our available RAM on sigyn (jails) in the hopes of improving
stability of the host.

Bumped the secretary's LDAP karma to a level at least on par with
a PMC Chair.

Niklas Gustavsson was granted infra karma for his work on our Jenkins
build infrastructure.

20 Jul 2011 [Philip Gollucci / Jim]

LDAP + TLS + LPK work is underway.

Justin Erenkrantz stepped down from root@.

Ordered 1y Warranty extension for selene and phoebe (Geronimo TCK builds).

Organizing an infra meetup to coincide with the Surge Conference in
Maryland next month (

Got listed in SORBS again.  Subsequently filled out the delistment forms
so all is well again.

Brought in-house to better deal with the spate of XSS

Started coordinating with existing sysadmins to plan for
eventual service migration.

Part of the backup system order has arrived in Florida.  The remainder is
due to arrive by the end of the week.

Arranged for travel from VA to FLL for Philip to help out with the racking
and safe deposit box for the backup system.

Work on upgrading to handle acct creation is progressing.

Uli Stark specced a replacement host for aurora (eu websites).

AI: Sam to follow up on Philip's missing credit card

15 Jun 2011 [Philip Gollucci / Jim]

Upgraded MoinMoin wiki to 1.8.8.

Mark Thomas Upgraded Jira to 4.3.4.

Niklas Upgraded Jenkins to 1.413 and moved it to a more stable URL
(removing hudson from the URL).

Took another whack at cleaning up /dist dirs.  Current status is appended
to the end of the report.

VP visited OSUOSL to survey the racks.

Looking to purchase additional RAM for sigyn (which keeps panicing
because of lack of RAM).

More XSS vulns reported against, which we don't
physically host.

Ran into a snag while placing the order for the replacement backup
system.  At this point we need the treasurer's sign-off on a Bank &
Trades agreement with Intervision who will subsequently provide us
with Net-30 terms for the order.

OSUOSL has notified us they are running low on cooling capacity, which
may affect our ability to host new machines at their datacenter.

OSUOSL wants us to purchase a switch to rationalize their cabling
system for our 3rd rack.  We plan to accommodate them once the backup
system has been purchased.

We are purchasing a warranty extension contract from Dell for selene
and phoebe, our Geronimo TCK build hosts in Traci.


Status of PMCs/podlings that had not completed /dist clean-up by third

Ignored all three e-mails to the PMC:
   buildr, cassandra, click, couchdb, logging,
   spamassassin, synapse, tcl, tiles

 Incubator (Graduated):
   buildr, river

   deltaclound, manifoldcf, olio, vcl, whirr

 chemistry - partially complete, dotcmis 0.1 needs to be removed
 empire-db - partially complete, 2.0.7 needs to be removed
 ws - partially complete, axis-c 1.5.0 needs to be removed
 santuario - argued, did nothing
 openwebbeans - argued, did nothing

Fixed and confirmed on third reminder:
 abdera, bval, cocoon, esme, hbase, hive, libcloud, nutch, oodt, perl,
 portals, qpid, roller, shiro, thrift, uima, wink, xmlbeans,

Should not have been on third reminder list:

19 May 2011 [Philip Gollucci / Jim]

Migrated (aka minotaur) to HP gear.

Discussed changing our Dell rep to a reseller better suited
for our business.

Received Technologent invoice via OSUOSL.

Started the initial steps of transferring ownership of to the ASF.

Tony Stevenson was brought on board as our third contracted
(part-time) sysadmin.

Setup ACLs to allow PMC members to browse the archives of
their own private lists.

Setup infrastructure for storing PGP fingerprints in LDAP.

Discussed the addition of a benchmark-running host for projects
to use, particularly lucene.

Continued pursuing the offenders on the XSS list reported to us
by security@ last month.

Scheduled an infra meetup to happen at the Surge conference
at the end of September.

Investigating some uptime problems with sigyn (tlp zones).

Received a pair of D53J JBODs from Silicon Mechanics, with
a partial refund (directed to treasurer@) for delays and lower
performance drives.

Addressing management of dist/ trees as requested by Greg Stein

It was observed that some account creation requests were not done for over a month. Prior experience was that accounts were processed weekly.

Should not happen again.

20 Apr 2011 [Philip Gollucci / Jim]

Cancelled our incorrect order for warranty support with Dell.
The order was subsequently replaced with a lower-cost version ($1000),
covering the same equipment for a 1y term.

Sent a purchase order to Technologent for 1y warranty support on thor
(confluence) and gaea (zones).

Purchased and received a Dell r210 for $1700 for managing our VMWare
VSphere installation.

Corrected our contact information with Dell for the umpteenth time.

Had OSUOSL deal with the mis-shipment from HP.  Correct replacement parts
are expected to arrive soon.

Provided a little mod_rewrite magic to simplify our main webserver

Upgraded Confluence in advance of a public security notice.

Purchased a pair of Intel nics: 1 to go into the HP gear and 1 for

Tweaked the network config on eris (svn) and eos (websites) to alleviate
chronic downtime problems.

Made an aborted attempt to upgrade minotaur (people) to new gear.  Will
be repeating that process shortly, this time using our HP donation as
the target host.

Mohammad Nour was granted karma to help Paul Davis sort out git hosting.

Received a comprehensive XSS vuln report for several services we offer.
Still working through the list.

Offered Ryan Pan infra-interest karma for his penchant for running
vmstat on minotaur (people).

Renewed our service warranty with Dell for 2y covering hermes (mail) and
eris (svn). Price: $1200.

Sorted out the confusion regarding the iLO license for the HP gear. was requested by a couple of PHP projects, it is now
live ready for projects to put PEAR releases on.

Citing a lack of time, Philip, the current VP, asked for someone
to be appointed by Jim, president.  It seems unanimously agreed, at
this time, this is not in the best interest of the foundation or the
infrastructure team.  Instead, some of this role's responsibilities,
which are too much for a volunteer over the long run, are going to be
doled out to a new part time paid contractor and the current full-time
staff.  Its hoped this will reduce the workload of the role to about
~5hrs/week. This is reasonable for an active volunteer.  We will
re-evaluate this after the new position is up and running.  Philip
is able to continue with the VP position for now with this reduced work

16 Mar 2011 [Philip Gollucci / Jim]

Upgraded ALL of our FreeBSD hosts to 8.2-RELEASE except for minotaur,
which is slated for replacement later this week.

Paul Davis was granted infra karma for his work on git hosting.

Reigned in the use of Jira accounts associated with mailing lists,
predominantly for security reasons.

Submitted a budget to the budget committee for 2011.

The long-awaited gear from HP has arrived at OSUOSL.  We are in the
process of having it racked and brought online in our new 3rd rack.
(Unfortunately some of the drives are incompatible with the hosts,
so we are awaiting additional gear from HP to resolve this).

Migrated user .forward files into ldap.  Ldap now is authoritative
for forwarding addresses.

Rationalized much of our vhost config for the tlp websites using

Initiated a general cleanup request to tlp's with large /dist/ directories.

OSUOSL has upgraded our bandwidth cap to 50mbps inbound 100mbps outbound
(up from 10 and 50 resp).

Our expected (and paid for) arrays from Silicon Mechanics are delayed
3 weeks pending arrival of the requested Hitachi drives.

Upgraded our svn servers to deal with announced DoS vulnerability.

Spoke with one of our Dell reps about pricing inconsistencies in our new
service contract.  It should be resolved in the near future (in our favor).

Arranged for an updated quote from Technologent for service on our remaining
Sun gear.

16 Feb 2011 [Philip Gollucci / Jim]

Renewed service contract with Dell for 1 year regarding baldr
and our 2 PowerVault arrays.

Received a VAT invoice for ~ 900 EU for the array we recently
shipped to SARA.  Awaiting a wire from the treasurer for payment.

Specced a pair of D53J JBODs from Silicon Mechanics.  Awaiting
a wire from the treasurer for payment.

Started discussing next year's budget.

We now have a 3rd rack to use courtesy of OSUOSL.

Reworked the script to overcome issues with opie.

Shut down the portals zone for security issues.

Deployed ckl- "CloudKick logging tool" to all our FreeBSD hosts.
See for details.

NERO, our network provider at OSUOSL, discovered an open HTTP proxy
on one of our hosts.  Upon further investigation we closed several
other outstanding security issues and removed root access from those
responsible for the poor setup.

Paul Davis continued his work on git hosting, setting up a mailer for

Daniel Shahaf was promoted to root@ for his outstanding work in several

Upgraded Jira to 4.2.x and added GreenHopper (supports agile
development) for all projects. We're also eating our own (fresher) dog
food since Jira now runs on the latest Tomcat 7 release.

19 Jan 2011 [Philip Gollucci / Jim]

Fixed all outstanding issues with the backup system.

Brought erebus online (one of the Dell's purchased last month),
to serve as our main VSphere host.

Setup a test instance of JIRA 4 in preparation for the 3.x to 4.x

Instituted a password policy which locks accounts for 24 hours after
10 failed login attempts.  See

for details.

Brought the CMS to a feature-complete 1.x state.  It is now ready
for wide-scale adoption, starting with the incubator; see

for details.

Updated our account details with Dell.

Dealt with extended outage during the New Year

Dealt with some wiki abuse reports from NERO regarding attachments.
As a result we have disabled the feature across the wiki farm.

Updated the LDAP scripts on to filter out redundant
entries in all "modify" operations.

RMA'd a failed drive back to Silicon Mechanics.

Promoted Daniel Shahaf to enjoy root karma on minotaur (people).

Daniel Shahaf setup our reverse ip zone master for our OSUOSL ip's
with OSUOSL's dns server slaving off that.

Specced a new JBOD array for service at about $7K.

Brought online (props to Ian Boston, Daniel Shahaf, and
Tony Stevenson); see

for details.

Confluence upgraded to the latest 3.x version, courtesy of Gavin McDonald.

15 Dec 2010 [Philip Gollucci / Jim]

Ordered a pair of Dell r510's for $24.2K, slated to be used
for jails and database hosting.

Discussed JVM settings on the ofbiz vmware instance with the
ofbiz PMC.

Discussed previously offered HP donation of a pair of servers:
one for and one for jails hosting.  Agreed to accept
the donation but will not seek to accept hardware donations
in the future.

Had Y! replace a pair of failed disks in minerva (builds). As it's
a RAID 5 array we had to reinstall the OS.

Patched our JIRA installation in response to the recent security
advisory from Atlassian.

Gavin McDonald did an analysis of our backups to date, pointing
out specific backups that are not going thru cleanly.

Bruce Snyder successfully arranged for a VSphere license from VMware.

Discussed fact that Justin Erenkrantz has moved away from the UCI
campus where our backup server is hosted.  The issue needs further
attention going forward as we need someone local who can change out
the tapes and put the used ones into a a safe deposit box.

Paul Davis made significant progress in our pursuit of read-write
git hosting.

It was noted that several projects gave thanks to infra for their work this period. Well done!

17 Nov 2010 [Philip Gollucci / Jim]

We specced and ordered an Opteron-based Dell R515 machine
for virtual machine hosting- cost ~ $12.5K.  A FC card was
thrown in (for reuse of the decommissioned helios array)
at no additional cost.

Chris Rhodes was given root@ karma.

We held a team meeting on Monday Nov 1 during ApacheCon.
Highlights were posted to infrastructure-private@.

Daniel Shahaf was made a member of the Infrastructure Team. was converted to the CMS with new templates
and stylesheets provided by Chris J. Davis.

HP has offered to supply our replacement servers for SARA.
Tony Stevenson is leading this conversation, as it could save
us roughly ~$30K from our budget.

Discussed the fact that Apple has EOL'd Xserves while we just
recently racked a donated pair of them.

Renewed the service warranty with Dell on baldr (jails) for 1y.

Plan drafted on what it would take for Git to be able to be
used as a project's primary source code repository. Currently
dependent on volunteers driving the effort.

Shane: yay new web site!

20 Oct 2010 [Philip Gollucci / Jim]

Helios (zones) has been decommissioned, replaced by FreeBSD jails
on sigyn.

5 new TLPs were processed: pig, hive, shiro, juddi, karaf.

We still need to purchase an Opteron-based box per the sponsorship
agreement with AMD.

Tony Stevenson fixed our busted zfs-snapshot-to-tape script on bia

Upgraded all our Linux hosts to deal with the announced local root
exploit bug.

Bruce Snyder continues to pursue a VMWare VSphere installation for

Coding work on the CMS has completed: for details see

We are planning to go "live" during Apachecon once the new templates
and stylesheets are available for

We received a DMCA takedown notice regarding some content in the
logging and Hadoop wikis.  The PMCs have been notified but have not
reported back on their progress.

Specced an FC card to allow us to reuse the helios array, providing
sigyn with more storage space.  Estimated cost ~ $1000.

The AMD box has not yet been purchased.

AI Jim facilitate the purchase of the box.

22 Sep 2010 [Philip Gollucci / Jim]

Gavin McDonald is in the final phase of migrating all necessary Solaris
zones from helios to FreeBSD jails.

Hudson master has moved to a new machine (aegis) and begun using LDAP,
thanks to Tony Stevenson and Niklas Gustavsson.  Jukka Zitting notes
it may be time to start experimenting with running Hudson slaves on EC2.

Sander Temme is preparing eve (Xserve) for hudson and buildbot usage.

Daniel Shahaf patched the downloads script to deal with a potential
XSS vulnerability.  Unfortunately some stray code wound up in production
due to anakia deployment issues, which took the downloads script down
for several hours.

Sander Striker signed a Dell "letter of liability" for Dutch equipment
purchases for SARA.

Stefan Bodewig was given full infra karma due to his vmgump contributions.

Ulrich Staerk was given full infra karma based on his work on,
jira, and confluence.

We received 2 disks from Silicon Mechanics and replaced a failed disk in
eos(www).  The replaced disk will be shipped back to Silicon Mechanics
under RMA.

We ordered a disk array from Silicon Mechanics to be drop-shipped to Bart
van der Schans in the Netherlands for eventual deployment at SARA.  Cost
was ~$5700.

Started up a project for creating a custom CMS for Apache.  Initially it
will target, with something for people to review around
Apachecon in November.

Gavin McDonald proposed some new equipment purchases to build our our
VM infrastructure.

Upayavira completed the domain transfer for

Ari Maniatis is pursuing hosting an svn mirror in Australia.

Don Brown is pursuing the idea of getting support
for the Confluence auto-export plugin.  Go Don!

18 Aug 2010 [Philip Gollucci / Justin]

Daniel Shahaf and Dave Johnson and Niklas Gustavsson were granted
infra-interest karma.

Norman Maurer brought new athena ( online.

Sander Temme shipped the pair of Xserves in his possession to OSUOSL.
They have been racked and are currently being configured for use.

Ari Maniatis has been in touch with the University of Sydney for
the purpose of both hosting a svn mirror and to provide facilities
for an Apache conference/barcamp.

We looked into the idea of holding an infra-thon before Apachecon
in November but the timing didn't seem to work out.  We will probably
try again in early 2011.

Began the process of speccing replacement hosts for our EU gear hosted
in SARA.

The replacement host for minotaur has arrived, been racked in OSUOSL,
and is currently being setup by Philip Gollucci.  Note that we are running
out of current capacity (amps) in our racks.

Gavin McDonald reached out to pmcs with zones on helios to tell them about
our migration plans: replacing those zones with FreeBSD jails.  Most have
responded in a timely manner.

Odin has been decommissioned and the vmware instances it hosted for vmbuild
and vmgump have been transferred to nyx.

One of our disks in the brand new eos array was defective.  It's been
taken out of service and is to be shipped back to Silicon Mechanics for
replacement.  We have also ordered a pair of spare drives from them
as well.

We ordered an array from Silicon Mechanics to be shipped to Bart van der
Schans in the Netherlands for ~$6000.  The array is to be part of our
replacement plan for aurora ( in SARA.

We've moved the Hudson master to a new machine (aegis) and begun
using LDAP for Hudson, thanks to Tony Stevenson and others.

Shane asks when amps shortage will be critical. Philip replies that it's under control.

Approved by general consent.

21 Jul 2010 [Philip Gollucci / Justin]

Philip Gollucci performed a general cleanup of the filesystem
on minotaur (people).

Ok'd Jukka's plan to setup Gerrit for hosting an Apache Lab.

Infra was notified of a compromised gmail account potentially
hacked as a result of the jira hack on Apache.

Joe Schaefer was called for 2 weeks of jury duty and the team
capably picked up the slack in his absence.  Kudos in particular
to Gavin McDonald, Philip Gollucci, and Paul Querna.

The new replacement machine for eos is currently online and
serving traffic for and
Migration of the moin wiki will be forthcoming soon.  Thanks to
Paul Querna and Philip Gollucci for doing the bulk of the setup.

The new replacement machine for athena is setup and should be
brought online as shortly.  Thanks to Norman Maurer and
Philip Gollucci for doing the bulk of the setup.

Received a "paid invoice" notice from Network Depot for the

Held some discussions between Mark Thomas and Philip Gollucci
about how to set-up baldr, the machine destined to host

Ordered a Dell R410 for ~$5000 to serve as a replacement for
minotaur (people).

Tony Stevenson made a few modifications to our LDAP tree to
better service Hudson and similar apps.

16 Jun 2010 [Philip Gollucci / Justin]

Sander Temme has been busy configuring the pair of Apple-
donated Xserves prior to shipment and installation at OSUOSL.

Gavin McDonald did some repair work on our backup systems.

Discussed the future of odin (vmware) while Mark Thomas
replaced a failed disk in it.

Mark Thomas was promoted to root@.

Brought an uri-shortening service online:
which takes advantage of LDAP + the Thawte-supplied wildcard cert.

No progress was made in bringing up the purchased replacement
host for eos.

The new machine Aegis was brought into service to replace Ceres as
the Buildbot Master, Ceres continues to be a Buildbot Slave.
Work is underway to move Hudson.zones Master to Aegis also.

Sebastian Bazley has taken up the charge to address a few key
crons that require access to private svn urls.

19 May 2010

Replaced a S300 software raid card with Perc 6i card in
aegis (builds).

Enforced a hard May 1, 2010 deadline for all admins to adopt
OPIE on all Linux and FreeBSD hosts at Apache.

Gavin McDonald upgraded Confluence to 3.2 - the latest available
version.  Dan Kulp was kind enough to patch the autoexport plugin
this time round, but we need to take another serious look at phasing
out confluence as a CMS, perhaps replacing it with Day's CQ5, in the
near future.

Initiated periodic password cracking program for our most sensitive
passwords, particularly LDAP passwords.  Early results identified
some 60 accounts vulnerable to dictionary-style attacks, and those
users have been contacted.  Also notable was the identification of
FreeBSD crypt as being a superior storage format for hashed passwords
as opposed to SSHA, so we are in the process of phasing out SSHA
for LDAP passwords.

We are in the process of compiling a list of accounts with security
issues that are no longer reachable via their email address.
Those will be the first group of accounts we close out.

We have cleaned up the root@ alias addresses and synced them with
committee-info.txt.  Notable changes were the removal of Roy Fielding,
Ted Husted, Joshua Slive, and Erik Abele, and the addition of Gavin
McDonald, Tony Stevenson, and Norman Maurer.

Due to port restrictions and lack of console access to Y! machines, the
Buildbot master was moved to the newly brought online 'aegis' builds
machine. The old host 'ceres' remains as a buildbot slave. Hudson master
is due to move across from Hudson.zones shortly.

Noirin Shirley was voted onto the Infrastructure Team for her editorial
work on the infra blog.

21 Apr 2010 [Philip Gollucci / Justin]

Norman Maurer upgraded several Solaris machines to the latest
available version.

Purchased 48GB RAM from Crucial slated for installation in
the upcoming eos (websites) replacement host.

Had OSUOSL ship Ken Coar the failed fireswamp (apacehcon) host.

Installed new dell 48 port switches and moved all public dracs
at OSUOSL to the private network.

Ordered a Dell R410 (slated to replace eos) with external SAS
card for $3800.

Ordered a 12-disk JBOD from Silicon Mechanics for $6400.

Ordered a Dell R210 (slated to replace athena) for $2200.

Ordered a SonicWall SRA 2000 for $2300 to replace crappy VPN device.

Took Paul Querna up on his offer for Cloudkick service for
host/service stats.

Ruediger Pluem was granted full infrastructure karma.

Working out details of an Apple donation of a pair of Xserves.

Brutus (issues, cwiki) got hacked.  The details are available

17 Mar 2010 [Philip Gollucci / Justin]

Philip Gollucci signed the annual service contract with Sun/Technologent
for ~$2K.

2 SSD's installed into eris(svn) to boost performance.  Between our EU and
US svn servers we currently handle over 6M hits / day.

RAM ($1200) installed into eris(svn) and brutus(jira,bugzilla) to boost

Website traffic to our tlp's and is hovering around 10M hits a day.

Spam traffic continues to fall: we are currently seeing only about 600K
connections per day, down from its peak of 1.5 M connections a day in 2006.

Philip Gollucci worked some magic and has upgraded all of our FreeBSD boxes
to 8.0-stable.  The old NGROUPS_MAX problem that previously limited users to
16 unix groups is a thing of the past.

Discussed and created a budget for FY 2010 worth ~$250K.

In discussions to purchase an Xserve from Apple for ~$6K.

Discussed an offer from a third party to host a virtual machine for us.
Ultimately the offer was declined.

Discussed plans for migrating Solaris zones to FreeBSD jails.

Aurora (websites) is down for an extended period of time until we can determine
whether or not to replace it immediately or have the machine serviced by a
Sun tech.

Gavin McDonald specced another dell for use as a build farm server for ~$6K.

Purchased a pair of Dell 5448 48-port managed switches for ~$1600.

Brad Davis of FreeBSD infrastructure subscribed to infra-private@.

Aristedes Maniatis was granted infrastructure-interest karma.

Geir asked about Technologent invoice, Phil to follow up.

SVN performance improvements are much appreciated!

17 Feb 2010 [Philip Gollucci / Justin]

Turns out the SAS card we ordered for eris (svn) is totally unsupported
by FreeBSD.  We have sent it back to Newegg and ordered a known-to-be-
supported Dell card instead, with additional cables from Provantage,
at a total cost of ~$350.

After several months of trouble with the nyx (vm's) raid card / array,
upgrading "everything" seems to have resolved the random disk failures.

Jukka Zitting has installed gerrit on for preliminary
(semi-authorized) testing of native server-side git support.

Mass mailed hundreds of committers about insecure / oversized items in their
home directory.  It is not encouraging to see the same people on the lists

Philip Gollucci negotiated new terms for the Sun support contract (to be
signed later this month).

Tony Stevenson and Chris Rhodes engineered the successful migration of Unix
and svn group data into LDAP.

Board agrees with the approach of temporarily locking out of committers who aren't paying attention to security notices.

20 Jan 2010 [Philip Gollucci / Justin]

De-racked 4 decommissioned machines: freyr, freyja, idunn, fireswamp.

Introduced ckl, a communication tool for distributed sysadmin teams,
into our workflow.

Tony Stevenson acquired a 2 year wildcard SSL certificate from Thawte.
All top-level project websites, including, are now
available over https.

Sketched out some preliminary plans for migrating the zones on helios
to jails on sigyn.

Replaced a failed disk in hermes(mail).

Ordered a pair of SSD's and a JBOD enclosure to house them from Silicon
Mechanics for ~$2400.  The order is expected to ship before Feb 1 and
will be used to beef up performance on eris (svn).

Ordered a corresponding PCIe 2xSAS card for eris (svn) to communicate with
the SSD's for ~$400.

Dan Kulp was granted root on brutus to assist with confluence admin.

16 Dec 2009 [Philip Gollucci / Justin]

Sander Striker ordered a disk replacement for nike ( and Bart
van der Schans installed it.

Began testing a searchable interface for private mail archives courtesy
of Chris Rhodes.

Started sending out periodic notifications to users with oversized home
directories, insecure permissions or private keys.

Improved our DNS configuration with input from Surfnet.

Ordered 6 SCSI drives to serve as replacements for various failed drives.

Tony Stevenson specced an HP machine slated to replace the aging minotaur

Tony Stevenson expanded our LDAP footprint to now be usable for logins
with all subversion repositories.

In contact with Sun to replace a failed controller in helios's array.

Philip Gollucci upgraded loki to FreeBSD 8.0.

Philip Gollucci patched FreeBSD on minotaur(people) to deal with a few
security advisories.

Martin Cooper and Davanum Srinivas dealt with a mass influx of spam
accounts into Confluence.

18 Nov 2009 [Philip M. Gollucci / Justin]

Philip Gollucci was appointed VP Infra, replacing Paul Querna.

Brian Fox was added to the Infrastructure Team.

A podling's distribution directory was compromised due to lax
permissions (our hacker from August had installed a backdoor
script in a user's public_html directory).  The offending
material was removed prior to being distributed to the mirrors.
A general cleanup of home-dirs with lax permissions was also

One of our virtual hosts was hacked, most likely due to a poor
choice of root passwords (and leaving remote-root-logins enabled).
The virtual host was summarily nuked as a result.

There was some confusion surrounding the creation of a DNS entry
for the upcoming Asia Roadshow event.

A general question was raised by Philip regarding how much Infra
has spent of its budget so far.  Advice from the Treasurer would be helpful.

Sander Striker is pursuing a replacement for clarus.

Infra held a face-to-face meeting at ApacheCon.  Notes were posted
to the infra-private list by Upayavira.

Tony Stevenson has started tackling LDAP again.

Chris Rhodes has been testing a service to provide web access to our private
archives to members.

The PDFBox TLP was created.

Subversion's repo has been migrated to the ASF repo.

Chris J. Davis began hacking up a new website for Apache.

Gavin McDonald continued to extend our buildbot offerings.

We seem to have unresolved issues with the tape library on bia (backups).

We're in discussions with HP regarding a hardware donation/loan.

We're in discussions with Thawte regarding SSL certificates.

Two disks have failed in nyx's (virtual machines) array.  Tony Stevenson
has partially addressed the situation, but we need to purchase replacement
drives ASAP.

One disk has failed in eos's (websites, wiki) array.  We're waiting for
a replacement drive to be purchased.

We're stalled on what to do about switch replacements.  The current (24 port)
switches are essentially full, and we need to purchase a switch or two with
more ports to replace them.

21 Oct 2009 [Paul Querna / Justin]

Moin wiki post-upgrade issues resolved.

Ruediger Pluem developed a patch for httpd to mitigate the ddos
issue plaguing brutus (jira,bugzilla).

Norman Maurer upgraded eos (wiki,websites) Solaris 10 u8.

Issues with the sync scripts were resolved by Tony Stevenson.

Athena ( was brought back online by Philip Gollucci after
osuosl replaced the power supply.

Phillip Gollucci developed an automated system for managing crons.

Paul Querna purchased a Dell Poweredge R610 for ~$4400 slated
as a replacement for helios (zones).

Justin Erenkrantz dealt with some IO issues on helios by shuffling
a few zones around.

Philip Gollucci cleaned up a few unused root accounts.

(Temporarily) suspended a committer's privileges pursuant to a board

General interest was expressed in participating in a cross-foundation
infrastructure list hosted by osuosl.

Gavin McDonald continued his work on the buildbot-based farm.

Gavin McDonald spec'ed a Dell machine to potentially be used as an
Australian svn mirror.

Philip Gollucci upgraded viewvc to 1.1.2.

Paul Querna ordered a fiber channel card for ~$450 to complement the
ordered Dell mentioned above.

23 Sep 2009 [Paul Querna / Justin]

 o Hacking Incident
 = First major security incident on our infrastructure since 2001[1].
There are always possible things to change, but we handled it well,
and have rebounded with one of the most active months in recent Infra
 = Initial Report:
 = Full Report:


o SARA/SURFnet hardware moving to new location inside same data
center. bvds heading up the local team on the ground.

o Added Gavin as a part time contractor.

o SvnPubSub developed
 = Notifies services of changes to the Subversion Repositories
 - Twitter Bot Online <>
 - Testing SvnWcSub, to keep a working copy in sync with a master
repository.  Will replace /dist/ and most websites distribution in the
long run, which is currently being done with rsync over SSH.

o mod_allowmethods developed
 = Disabled all non-GET requests on most VHosts for *

o mod_asf_mirrorcgi developed
 = Hack to map our hundreds of identical download.cgi scripts to
invoke the same CGI directly.

o Disabled CGI support on most vhosts for *

o MoinMoin wiki upgraded from 1.3.x to 1.8.x

o FastCGI via mod_fcgid is now used for the wiki and mirrors.cgi

o Nyx setup with VMWare to host various VMs.

o Enabled OPIE on Brutus & Nyx.

o Dealt with ZFS issues on minotaur(people). Had to rebuild the array
after a disk died.

o Replaced with for one of our Slave DNS Servers.

o Coordinated a security fix for Bugzilla.  We were contacted ahead of
time by Bugzilla developers, and given a patch to apply before they
made a public disclosure.  Mark has since updated us to their new

o Requested new Solaris 10 OS subscription keys from contact at Sun.

o thor brought online (to host svn-dist and search services)

o eos, bia, thor, and aurora upgraded to Solaris 10 u7.

o New sshd_config using SSH keys stored in SVN for infra members has
been deployed on most machines.

o In progress of removing unneeded access & sudo to several machines
(hermes, brutus, minotaur)

o promoted norman,pctony,gmcdonald to root@ on all fbsd boxes

o zfs is declared production ready in 8.0-RELEASE when it comes out

o Minotaur(people)
 = Updated to 7-stable
 = Updated ports:
  - hpn-ssh - is now a port
 = Updated,
   httpd 2.2.11 -> 2.2.13
 = converted to dns/bind96
 = setup no-ip as dns slaves
 = started ipfw->pf conversion

o hermes(smtp)
 = Updated to 7-stable

o hercules(
 = Updated to 7-stable
 = setup

o eris(
 = Updated to 7-stable
 = updated ports
  - serf is now a port
 = updated svn 1.6.1 -> 1.6.5
 = updated httpd 2.2.11 -> 2.2.13
 = updated svnmailer
 = attempted viewc update
  - fixed viewvc file contents bug

o harmonia(
 = Updated to 7-stable
 = updated ports
  - serf is now a port
 = updated svn 1.6.1 -> 1.6.5
 = updated httpd 2.2.11 -> 2.2.13
 = updated svnmailer
 = attempted viewc update
  - fixed viewvc file contents bug

o athena(
 = Updated to 7-stable
 = replaced dead disk ad4 [osuosl]
 = replaced doa disk again [osuosl]
 = updated httpd 2.2.11 -> 2.2.13
 = updated ports

o nike(
 = Updated to 7-stable
 = updated httpd 2.2.11 -> 2.2.13
 = updated ports
 = updated lom [osuosl]

o loki(tb,ftp,cold spare)
 = Updated to 7-stable
 = updated ports

19 Aug 2009 [Paul Querna / Justin]

Sander Striker is still looking into purchasing a pair of replacement
drives for aurora (.eu website mirror).

Installed the recently purchased r710 Dell box as nyx. Tony Stevenson
set up VMware ESXi on nyx and created a few virtual hosts, one XP based.

We're looking to return the four IBM x345's on loan and awaiting
feedback from IBM (contact point Sam) on where the machines should
be shipped.

Fireswamp's (apachecon) power supply failed.  The machine is too
old to try and service with replacement parts; we will create a
VMware host on nyx, restored from fireswamp backups, as a replacement
soonish.  In the meantime we're redirecting all
pages to

Attempts to round up volunteers for a softball game against the board
haven't met with much success, largely due to the short (and as yet
unknown) timeframe of the board retreat.  We may need to reschedule
the game to coincide with some other Apache related get-together sometime
next year.

Dealt with some unexpected/unplanned networking issues on our build
servers in the Yahoo! data center.

Philip Gollucci replaced the over-capacity gmirror-based /x1 array on with a shiny new raidz2 zfs-based array.  After a week
or so of random zfs failures, things seem to have stabilized since upgrading
the host (minotaur) to 7.2-STABLE.

Tony Stevenson sent out another request to PMC chairs to ensure the
asf-authorization file is compatible with our plans to migrate group
data to LDAP (aka phase 2 of the LDAP migration).  Compliance is still
a mixed bag, and we will likely just do the deed for the non-compliant

In the early discussion phase for a potential new Microsoft-based build

Gavin McDonald notes Buildbot can now automatically deploy snapshots to via nexus or to a custom download page.  (See as a working example.)
Reminder that Buildbot is geared up to produce RAT reports for any project
that wants it.

Paul Querna notes problems with the Confluence auto-export referencing
Javascript and CSS hosted on brutus caused some downtime for other services
on brutus (JIRA, Bugzilla, etc).

15 Jul 2009 [Paul Querna / Justin]

Justin Mason arranged for a free license to the Spamhaus DNSBL

Tony Stevenson and Chris Rhodes began testing phase 2 of LDAP
service on harmonia (svn mirror).

We lost another drive in aurora (.eu website mirror).  Sander
Striker is investigating a pair of replacement drives.

Gavin McDonald was voted into the root@ club for his interest
in FreeBSD maintenance.

Purchased an r710 Dell box for VMWare hosting for ~$5K.

Purchased RAM SCSI Card and additional drives from NewEgg
for ~$1100.

Mads Toftum upgraded aurora (.eu website mirror) to Solaris 10 5/09.

Don Brown upgraded our Confluence installation to 2.10.3.

Gavin McDonald is organizing the return of the four IBM x345s
on loan.

We continue to experience availability issues with eris (svn)
due to zfs issues with FreeBSD.

VMWare Workstation on odin was upgraded to 6.5.

17 Jun 2009 [Paul Querna / Justin]

Tony Stevenson completed phase 1 of the LDAP migration, migrating user
accounts on into LDAP.

Sander Striker promised to someday order a replacement disk for aurora
(websites) and have it shipped to Bart van der Schans in the Netherlands.

The SAS cable we RMA'd back to Provantage was returned back to us as an
invalid RMA.  We have procured an UPS shipping label from Provantage and
are attempting to resend it.

Infrastructure has made a request to PMC chairs to help us with Phase 2 of
the LDAP migration: bringing groups into LDAP.  The majority have complied,
while a large number of PMC's have yet to do so.

IPv6 support was disabled until we are better positioned to be able to
monitor and maintain it.

Henk Penning continued to keep a careful eye on the mirroring system.

Brian Fox continued his support for the Nexus installation at

Mark Thomas upgraded our Bugzilla instances to the latest version.

Chris Rhodes was voted in as a new Infrastructure committer.

Gavin McDonald continued to enhance our buildbot service at

20 May 2009 [Paul Querna / Justin]

Shipped errant SAS cable back to Provantage.

16 new backup tapes ordered, charged to ASF credit card.

Began organizing volunteers for moving our gear in SARA. We still
need to purchase a disk replacement for aurora (websites).

Philip Gollucci upgraded all of our FreeBSD servers to 7.2-RELEASE,
based on a central machine,, for compiling and pushing
out the software.

Paul Querna upgraded subversion to 1.6, splitting out the remaining
private portions of the "asf" repository into a new "infra" repository.
Note to officers: the new location containing the asf-authorization file is

Sent an opt-out letter for Phorm scanning on and related domains.

Discussed the lack of progress with respect to upgrading the Confluence
auto-export plugin to be compatible with recent releases of Confluence.
Adaptavist again claims to be nearly finished with the work (ETA 2 months),
but if the situation hasn't been resolved in that timeframe we will need to
pursue other options, including migrating all of to the
latest version of moin-moin.

Updated the Release FAQ, with input from several contributors.

15 Apr 2009 [Paul Querna / Justin]

Submitted a budget to the budget committee.

Purchased a 1-year extension to the Technologent service contract for
our Sun gear at $4K.

Set up a blogging infrastructure for projects to use at

Henri Yandell merged the Click, Cayenne, and Roller jiras into the main

Our Dotster account was hacked (again), the impact of which was to see
our DNS glue records for changed for a brief period.  Discussions
to change our registrar are ongoing.

Migrated our core mail server (hermes) from our last IBM x345 in service
to one of our new Dell 2950's.  The old gear will be deracked and sent back
to IBM along with the other x345's.

Set up Geographic DNS for to distribute traffic between
our master server (eris) and our European mirror (harmonia).

Purchased a Linksys RV042 VPN device for $138.

Work on the new continues, principally being performed
by Jukka Zitting and Grzegorz Kossakowski.

Work on LDAP at the ASF continues, being driven by Tony Stevenson.

Work on the Buildbot installation continues, being driven by Gavin McDonald.
We've set up a new mailing list for build services at

Norman Maurer upgraded our backup server (bia) to Solaris 10u6.

Lost a disk in both aurora (websites) and minotaur (people).

18 Mar 2009 [Paul Querna / Justin]

Replaced a failed disk in eris (svn).

Discussed what to do about disused accounts on
No actions taken at this time.

Replaced a failed disk in eos (websites).  Norman Maurer rebuilt
the array to be based on raidz2 instead of raidz1, and upgraded
the operating system to Solaris 10u6 + patches.

Gavin McDonald convinced Yahoo! to open up the buildbot server
port on ceres (builds).

Discussed a budget - the general consensus is that the infra
budget should be $150K-$160K per year, but folks on the budget
committee are looking for a more detailed breakdown of the
hardware portion.

Discussed an infra meetup between various open source
organizations, with the intent to fund travel should it come to

Norman Maurer convinced the SpamAssassin PMC to move their IO-
intensive jobs to their zone on odyne.

Shipped the errant SAS cable back to was setup on odyne for Jukka Zitting to work

Coordinated the move of to,
with the help of several member volunteers.

Henri Yandell carefully upgraded all of our Jira installations to

Sebastian Bazley continued his work validating foundation records.

Henk Penning continued his work handling mirror requests.

Justin confirms that the $150-160k figure includes staffing

18 Feb 2009 [Paul Querna / Justin]

Philip Gollucci successfully upgraded to
FreeBSD 7.1.

Paul Querna purchased cables, SAS card, and an array for
replacing the existing array on, for ~ $4500.
Unfortunately the cable provider sent us the wrong cable, so
we had to order a replacement.

Paul Querna and Sander Striker coordinated with our datacenter
providers to enable IPv6 routing for our websites.

Tony Stevenson set up backup services for the Apachecon hosts.

Paul Querna restructured the DNS zone to be generated from
a script.

Gavin McDonald continues to work on a buildbot installation on 2
of our new Yahoo! hosts.

Gavin McDonald pushes for a budget for infrastructure by the end of

Brian Fox set up a Nexus instance on the
zone to facilitate moving some of our maven infrastructure to

Paul Querna renewed the DNS for 2 years.

Confluence is still mired in the futility of hoping someone will fix
the autoexport plugin, despite many promises from Atlassian.

Mads Toftum and Norman Maurer are discussing how best to provision
thor for the services we will place on it.

Eos (websites) lost a ZFS disk which took it out of commission for a
few days.  Services were moved to aurora to prevent a significant

Infrabot was significantly enhanced, adding many new collaboration
features.  It is worth noting that infrabot is now on twitter at , which we expect to use for service
announcements and outreach to folks too busy to follow the
infrastructure mailing lists.

21 Jan 2009 [Paul Querna / Justin]

Discussions about what to do with thor now that the new
disks are installed continue.

The new Yahoo! machines are online and being configured
by Nigel Daley and Gavin McDonald with build services.

Henri Yandell continues to wrestle with a Jira upgrade.

Purchased a new array, cables, and SAS card for minotaur
(aka for $4700.

Brought loki (hot-spare) online with FreeBSD 7.1.  We're
planning to do the same for new hermes (mail) next month.

Tony Stevenson continues to lead the LDAP deployment discussions
on infrastructure-dev@.

Two TLP migrations, buildr and camel, were completed.

Roughly 2 dozen new accounts were created.

Philip Gollucci is planning to upgrade minotaur (
to FreeBSD 7.1 in preparation for the arrival of the forementioned

Paul clarified that the original intent for Thor was as a build server, but the new build machines from Y! turned out to be a better match due to bandwidth constraints and location.

17 Dec 2008 [Paul Querna / Justin]

6 SAS disks were purchased for use in thor: cost $2000.
Discussions for repurposing thor as a database/blog server
once the disks are installed are ongoing.

4 Yahoo! build-farm machines have been made available to the
infrastructure team for configuration. Discussions for
requesting another hired sysadmin on a part-time basis
to provide centralized build services are underway.

Henri Yandell continues to wrestle with a jira upgrade.

Bart van der Schans visited the SARA colo again to help
with maintenance on harmonia (svn mirror) and nike (mail).

Paul Querna continues work on the new pages.

Sander Striker and Paul Querna are pursuing an IPv6 allocation
for apache hosts.

Adaptivist has offered to continue James Dumay's work on the
autoexport plugin (which still prevents us from upgrading Confluence).

Atlassian has offered to host an SVN mirror for providing more
Fisheye sites.

Renewed the domain for an additional 2 years.

Philip Gollucci somehow figured out how to upgrade our two frontline
mailservers, nike and athena, to FreeBSD-7-STABLE.

Renewed the SSL certificates for and

Upgraded all ~800 apache mailing lists to support the emerging SRS
and BATV specifications.

Tony Stevenson continues to pursue LDAP deployment at the ASF.

Chris J. Davis was brought on as a new infrastructure committer.

Jukka Zitting was added to the infrastructure team for his work
on git mirrors.

The attic pmc infrastructure was set up.

Three TLP migrations - abdera, qpid, and couchdb - were completed.

We discussed the importance of establishing a budget before we can evaluate the request for a sysadmin.

19 Nov 2008 [Paul Querna / Justin]

Yahoo! talks about a build farm are proceeding at our normal slow and steady

T-shirts were distributed to team members.

Held a face-to-face meeting at Apachecon with irc logging.

Henk Penning and Gavin McDonald led the effort to clean out the mirror system,
successfully purging over 6GB of redundant artifacts.

Made progress with bringing Confluence up-to-date, by first recognizing its
importance to current operations and then working with James Dumay on irc to
bring the autoexporter plugin up to date.  We expect to make more progress
once James' patch has been tested and incorporated into the autoexporter tree.

No progress has been made in replacing hermes (mail). We're looking forward to
tackling that when FreeBSD releases 7.1, with expected improvements in its zfs

Norman Maurer is still waiting for us to order disk replacements for thor

Took a few failed cracks at upgrading nike (mail) to FreeBSD 7 with Bart van
der Schans' on-site help.  Will continue tackling this problem before pursuing
FreeBSD upgrades to other x2200's in service.

Offered Chris Davis committership under infrastructure's purview. He has
accepted the offer.

Tony Stevenson-led investigations into deploying LDAP at the ASF have gained
some speed.  We've talked with the Apache Directory project a bit about
potentially using their software for this purpose.

Paul Querna provided the Syracuse researchers with a filtered dump of our
public svn tree, and will be coordinating with them to determine the best
way of keeping their soon-to-be-published copy of our repo current. We also
pointed them at the rsync location of our raw mail archives.

Justin Erenkrantz purchased a few new SSL certs to replace those that were
scheduled to expire next month.

Paul notes that we may need another part time administrator next year. Don't feel comfortable tasking that to our existing administrators as this requires root.

The infrastructure team is looking into Covalent's tools as time permits.

15 Oct 2008 [Paul Querna / Justin]

Yahoo! talks about a build farm are ongoing.

No progress has been made in replacing hermes (mail).
We're looking forward to tackling that when FreeBSD
releases 7.1, with expected improvements in its zfs support.

Fail2Ban, a tool for guarding against ssh scans, has been
implemented on by Tony Stevenson.

Some issues with Confluence, particularly its license,
have arisen. We are currently stuck between a rock
and a hard place in that we cannot upgrade it without
breaking the autoexport plugin, which is a core feature
of the service.

Disk replacements for thor (builds) have not been ordered,
pending a callback from CDW.

Mark Thomas was granted infra karma for his work on Bugzilla.

Discussions about creating a vmware instance for Windows
are ongoing.

Our svn servers are now servicing over 2M requests per day,
which is a doubling of activity over the past year.

Our automated checks against wiki spam have blocked roughly
100K attempts over the past year.

Jukka Zitting continues his impressive work experimenting
with git at Apache on infrastructure-dev@.

17 Sep 2008 [Paul Querna / Justin]

We've fallen a bit behind our machine upgrade schedule due mainly to
persistent concerns over the stability of FreeBSD on eris (svn).  The
machines that need to be brought online are loki (a cold spare) and hermes
(a drop-in replacement for the existing x345 which serves mail).

Dealing with intermittent problems with the build process on the hudson zone.

Transferred the vmsa vmware instance to a zone on odyne for performance

Created roughly 2 dozen new committer accounts.

Sebastian Bazley continues his work rationalizing foundation records and
authoring supporting scripts.

An issue came up regarding the ability, or lack thereof, of purging sensitive
data from the svn repository.  No action was taken at this time.

Yahoo! has been in touch with us to resume talks about a build farm donation.

Paul continues to work the issue on the incompatible drive trays.

20 Aug 2008 [Paul Querna / Justin]

We have a functional backup system in place, complete
with backups of select files within user home dirs,
thanks to Tony Stevenson, Gavin McDonald, Norman
Maurer, and Roy T. Fielding.

Philip Gollucci was flown down to Fort Lauderdale
to help set up the new colo site at

The two Geronimo build machines, selene and phoebe,
were set up and handed off to the Geronimo PMC.

An experimental LDAP zone was set up to pursue
the idea of deploying LDAP in some capacity at the ASF.

Purchased a SCSI card for thor (build zones). Unfortunately
the existing array failed miserably (both PSU's died).
Currently pursuing a different path of installing drives
in thor (said drives were also purchased from Sun for ~$2000,
but need to be replaced due to incompatible drive trays.)

Wendy Smoak was granted infrastructure karma for her
work on the ASF's maven repository.

The maven snapshot repository was purged of all files older
than 30 days, which created some ripples within the community.
At its largest it was over 90GB, which means it contained
more bits than It currently stands
at 21GB.

Henning will work with Geronimo folks to start load monitoring on the new machines provided to their projects.

16 Jul 2008 [Justin Erenkrantz]

Purchased two dell 1950s for $10K.  The machines have been shipped
to the sysadmin and will be deployed shortly for Geronimo to use
for TCK testing.  We investigated EC2 as an alternative, but found
it wasn't cost-effective for the needs of the Geronimo PMC.

Entered into a monthly agreement with for ~$500 / month
for colocation services in Florida.

Ordered a switch and miscellaneous cables for setting up the colo.

Helped some new zone admins set up shop.

Set up the Sun t5220 as thor, which will be used for build systems
once we have the disk situation sorted out.

Work continues on setting up bia (backups), driven by Tony Stevenson,
Gavin McDonald, and Norman Maurer.

Set up odyne in SARA, which will be used as a zone host.

Made progress with Jason van Zyl regarding the adoption of and
the central repo machine by the ASF.

25 Jun 2008

Appointment of Infrastructure Committee Chair

 WHEREAS, the Board of Directors heretofore has charged the
 President with the responsibility of overseeing the activities
 of the ad-hoc Infrastructure Committee using the President's
 existing authority to enter into contracts and expend foundation
 funds for infrastructure, and

 WHEREAS, the Board of Directors recognizes Paul Querna as the
 appropriate individual to chair the infrastructure committee,
 with respect to executing the board approved infrastructure plan
 binding the Foundation to infrastructure contracts and associated
 financial obligations,

 NOW, THEREFORE, BE IT RESOLVED, that Paul Querna be and hereby is
 appointed to the office of Vice President, Apache Infrastructure,
 to serve in accordance with and subject to the direction of the
 President and the Bylaws of the Foundation until death, resignation,
 retirement, removal or disqualification, or until a successor is

 Special Order 7H, Appointment of Infrastructure Committee Chair,
 was approved by Unanimous Vote of the directors present.

25 Jun 2008 [Justin Erenkrantz]

Addressed the board's suggestion for more descriptive
service names by updating our nagios config and
improving the documentation in dev/machines.html.

Work continues on setting up bia (backups), driven by
Norman Maurer, Tony Stevenson and Gavin McDonald.

Gaea (zones) was determined to have BIOS issues.  As
mentioned in the previous report, it was shutting itself
down without warning.  A ticket was filed with Sun, and
remains open.  The problem hasn't occurred since upgrading
the BIOS, but we are carefully monitoring the machine.

Henk Penning continues to keep a close eye on the operation
of the rsync mirrors and their signed contents.

We purchased a NIC card and 8 GB of RAM.  The NIC card is
insurance against a failing NIC in eris (svn), and the 8 GB
of RAM was divided between gaea and hyperion (zones).

Several zones were created, the tuscany TLP was migrated,
and roughly 30 new accounts were created.

We are working with OSUOSL to get new hermes (mail) online and
a new Sun 5220 racked.

Eos and aurora (websites) had system upgrades performed by Mads
Toftum and Norman Maurer.

Norman Maurer was granted root karma on

Gavin McDonald was granted apmail karma.

The general availability of was announced on

Upayavira volunteered to represent the infra team at an OSS Watch
workshop on profiling open source communities.

21 May 2008 [Justin Erenkrantz]

We purchased 3 Dell 2950s costing about $12K total.

We purchased a certificate for, intended for use as an
svn mirror of our main repos. We are in the final testing stages now, and
expect to bring this machine into community service in the very near

Sun donated 8 SAS drives and we had them shipped to Sander Striker. The
drives will eventually be installed in odyne, the 4150 we cannibalized to
get harmonia up.

Sun has offered to provide us with a support contract for our machines in
SARA, at no charge to the ASF. Details have yet to be finalized.

Old eris has retired itself due to its failing RAID array, and one of the
2950s was pressed into service via trial by fire. A tremendous amount of
effort went into bringing the replacement box online and stabilizing it-
chiefly by Philip Gollucci, Norman Maurer, Paul Querna, and Tony Stevenson.

Work continues on setting up bia, our backup host, driven by Norman Maurer
and Tony Stevenson.

Roy T. Fielding made significant improvements to our ezmlm installation,
eliminating the need for moderation on our commit lists.

Sebastian Bazley has done an incredible amount of work rationalizing
various foundation records.

Roughly 3 dozen new accounts were created this month.

2 TLP's, archiva and cxf, have been migrated.

In light of the recent Debian/Ubuntu security advisory (CVE-2008-0166)
regarding ssh/ssl, we have upgraded the host keys on all of our Ubuntu
hosts, and have scanned all the public keys on We
investigated the ssl certificate on brutus and found that it predates the

4 committers' and 2 members' accounts had their public ssh keys disabled on for failing to comply with a request from root@ to remove
them within 48 hours of being notified. The keys in question were all
detected by the script, and most users (~30 total) who received
the notice dutifully complied.

Gaea has started shutting itself down occasionally for reasons unknown. We
are investigating, but so far there's been little information to go on in
the logfiles.

It was noted that a map of host names to services can be found at

Concerns about host names are deferred to the infrastructure team.

16 Apr 2008 [Justin Erenkrantz]

Purchased a 1y silver-level Sun support contract from Technologent
regarding OSUOSL-located Sun equipment.

We had the system board and a DIMM in eos replaced under the aforementioned
support contract.

Work continues on setting up bia, our backup host, mainly driven by Tony
Stevenson and Norman Maurer.  Both of them have been granted apmail karma.

Roughly 2 dozen new account requests were processed.

Sun graciously donated a pair of x4150s for use at SARA.   We have brought
one of them online as harmonia, and will be pressing it into service as an
svn mirror site.

A confluence-backed website was hacked into due to an improper permissions
scheme.  As confluence typically does not provide change notifications, no
record of the event was sent to the affected project's mailing lists. It
was later determined by David Blevins that roughly 30 other confluence
spaces were also misconfigured, further reinforcing the opinion of many
infrastructure members that the confluence installation at the ASF is a
bridge to nowhere.

Wendy Smoak continues her excellent work reviewing changes to the Apache
maven repos on repository@.

Gavin McDonald, Norman Maurer, and Tony Stevenson were all granted full
infrastructure karma.

19 Mar 2008

Pressed new brutus machine into production, shut down old brutus.

Brought bia (the new backup machine) online.  Work continues on
setting up the actual backups, mainly driven by Tony Stevenson
and Norman Maurer.

Experiencing some problems with jira's availability on new
brutus.  Jeff Turner is looking into it.

Migrated bugzilla to 3.0.  Special thanks to Mark Thomas and
Sander Temme for all the hard work that went into that process.

TLP migration for continuum was completed.  Roughly 2 dozen
new account requests were processed.

infrastructure-dev@ was made a public list at Jukka Zitting's

19 Dec 2007

Acquisition strategy through May 2009; bottom line figure is $58,900.

Key features:
 - Replace our aging IBM x345s which are currently the cornerstone of
   our infrastructure - they are nearing 4 years old.
 - Stagger replacements so as not to do it all in one go
 - Gets a SVN mirror in EU
 - Equip a respectable build farm
 - Equip for CMS 'thing-ma-bob' with staging + 2 prod servers

The specifics of the machines may change, but this is an overall plan.

Acquisition strategy:

- OSL: Stay relatively power-neutral for next 4-6 mos; expand after then
- SARA: Expand to 20U 'early next year' (pushing for Feb.)
- x345s: Acquired in late 2003 / hermes prod in Feb. 2004
- Helios: In-service approx. April 2005

 Base configuration: stick with x2200M2 with SATA drives


Base equipment costs [as of 10/28/2007]:

 - x2200 M2 - 1x 2210 / 2GB / No drives: $1619/ea (incl. tax+shipping)
 - x2200 M2 - 2x 2218 / 8GB / No drives: $3871/ea (incl. tax+shipping)
 - x4150 - 1x Intel E5320 (1.86GHz; Quad) / 2GB / No HDD: $3082/ea (incl. t&s)

Machine extras:
 - CPU: AMD Opteron 2210 (1.8GHz): $179/ea from Newegg
   CPU: AMD Opteron 2218 (2.6GHz): $455/ea from Newegg
 - RAM: DDR2 PC2-5300 / CL=5 / Registered / ECC / DDR2-667 / 1.8V
  2GB sticks from Crucial: $135/ea [Buy in pairs]
  8GB  (4x2GB) -> $600 (incl. tax) [$581.81]
  16GB (8x2GB) -> $1200
  32GB (16x2GB)-> $2400

 - Hitachi A7K1000 750GB SATA drive: $230/ea from Newegg
 - Seagate ES.2 1TB SATA drive: $339.99/ea from Newegg
 - Factor $250 for 750GB; $350 for 1TB

Derived cost for manual upgrade of base x2200 config:
 - 2x 2210 / 10GB / 2x 750GB -> $2919/ea ($3000/ea)
 - 2x 2210 / 10GB / 2x 1TB -> $3119/ea ($3200/ea)

2nd-level x2200 config:
 - 2x 2218 / 8GB / 2x 750GB -> $4371/ea ($4400/ea)

3rd-level x2200 config:
 - 2x 2218 / 8GB / 2x 1TB -> $4571/ea ($4600/ea)
 - 2x 2218 / 16GB / 2x 1TB -> $5171/ea ($5200/ea)
 - 2x 2218 / 32GB / 2x 1TB -> $6371/ea ($6400/ea)

x4150 config:
 - 1x E5320 / 4GB / 6x 750GB -> $4982/ea ($5000/ea)


Helios [Solaris zones]:

 13x 750GB SATA drives = $3250
 Battery backup replacement (Jan-Feb): $450 [370-5545-BAT]

Conditional: Needs correct braces from Sun; ETA next week @ OSL.
Purchase: December
In-service: January
Helios array total: $3700


Brutus [Issues: JIRA/Bugzilla/Confluence]:

x2200 M2: 2x 2218s (4 cores) / 8GB RAM / 2x 1TB SATA / Linux x86_64
 [ Atlassian will support via official Sun x86_64 JVM ]
Purchase: December
In-service: January @ OSL
Expected price: $4600


SVN mirror @ SARA:

x4150: 1 Quad-Core Xeon / 4GB RAM / 6x 750GB SATA / FreeBSD
Conditional on: x4150 being available with SATA drives
Purchase: Late Dec / Early January
In-service: February @ SARA
Expected price: $5000
 [SARA box needs to be purchased thru Sun .NL; so may be more if in EUR.]


Eris replacement [SVN @ OSL]:

x4150: 1 Quad-Core Xeon / 4GB RAM / 6x 750GB SATA / FreeBSD
Purchase: March (after SVN mirror @ SARA setup)
In-service: April
Expected price: $5000


Loki [cold-spare @ OSL]:

x2200 M2: 2x 2210s (2 cores) / 10GB RAM / 2x 750GB SATA / FreeBSD
Purchase: May
In-service: June
Expected price: $3000


Hermes [mail @ OSL]:

x2200 M2: 2x 2210s (2 cores) / 10GB RAM / 2x 750TB SATA / FreeBSD
Purchase: July
In-service: August-September
Expected price: $3000


Build farm (@ OSL) - stage 1:

(2) x2200 M2: 2x 2210s (4 cores) / 16GB RAM / 2x 1TB SATA / Linux (VMWare Wks)
Purchase: September
In-service: October-November
Expected price: $11,400 (2 @ $5200)


Build farm (@ OSL) - stage 2:

Apple xServe: 2x Dual-Core Intel Xeon / 1GB RAM / 80GB SATA
Purchase: January '09
In-service: February '09
Expected price: $3200 ($2,999 + tax&shipping)


CMS thing-ma-bob's:

Staging @ OSL: x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux
Prod @ OSL:    x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux
Prod @ SARA:   x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux
Purchase: February '09
In-service: March-May '09
Expected price: $20,000 (3x$6400)
 [SARA box needs to be purchased thru Sun .NL; so may be more if in EUR.]

19 Jul 2006 [Sander Striker]

Approved by General Consent

27 Jun 2006 [Sander Striker]

Tabled due to time constraints.

15 Mar 2006 [Sander Striker]

Sander provided an overview of the current state of the ASF Infrastructure, as summarized in the President's report above.

Approved by General Consent.

21 Dec 2005 [Sander Striker]

No report received or submitted. Sander to be contacted regarding status.

21 Sep 2005 [Sander Striker]

Sander submitted an oral report, noting that Infrastructure was very busy. Leo Simons sent an Email to committers@ asking for volunteers to help.

Approved by General Consent.

22 Jun 2005


The infrastructure team has been so busy it hurts. We have migrated a few
more projects to top level, migrated a few from cvs to svn, added some new
infrastructure projects, users, "the usual". We are seeing roughly 600
emails a month on the infrastructure mailing list, excluding svn commit

Besides the usual, some things of note include that

* we have slowly gotten to work on the board's request to formulate RFPs
 on paid staff/outsourcing.

* OSU OSL has generously offered to provide us some hardware along with
 hosting at their colo.

* we have set up a new mailing list,, which is tasked
 with figuring out a flexible and powerful new website publishing

* we have sollicited volunteers which has yielded roughly half a dozen
 responses from previously silent people as well as somewhat inactive
 infrastructure participants offering to become more active.

* we have found we are not currently in optimal shape for productively
 growing our team and getting people things to do. Work is in progress
 to improve that.

* quite a bit of work has been done and is still in progress on internal and
 project-oriented documentation.

* we have some promising submissions to the google summer of code programme
 for helping with the asf-workflow tool.

* we are planning another infra-thon (infrastructure team gettogether) in
 the weekend leading up to apachecon which should be considerably more
 modest and hence cheaper than the last one.

* our mailserver has had a lot of trouble handling the load (mostly
 spam) lately. Solutions being explored include optimizing the
 machine's configuration, patching the mail software to be more efficient,
 taking into operation a new machine at OSU OSL, and generally anything
 else we can think of. It has taken a lot of time just to keep things
 running, and we have seen some service interruptions. We may need more
 hardware if the amount of spam that goes around the web continues to grow.

* brutus has been partially reinstalled to serve as another FreeBSD host,
 which hopefully will be finished this week. We will use it to take over
 some of minotaur (our main box) its duties as we upgrade that.

* AMD has donated a machine for running which we will be
 hosting in our rack in the San Francisco colo. The machine is scheduled
 for install this week.

* loki (running vmware) has been partially configured for gump runs.

* we have another machine on loan at OSU OSL for gump runs which is not
 in operation yet.

* quite a few (dozen??) of zones have been set up on the new sun box,
 helios. Various PMCs have been busy setting a variety of services up.
 Helios is still in a testing phase.

* we are hoping to take the raid array that came with helios operational
 this week.

* we experienced a lot of performance problems with the wiki which were
 solved by putting in place a development install of apache httpd with

* JIRA has been upgraded to the most recent release. It is still being
 tuned to be as stable as it was before. Users are seeing a notable
 performance increase and fancy new features such as automated links to
 subversion changes.

* Our SVN service had a few hickups, but incident rate seems decreasing
 whereas usage is still increasing. The majority of our projects are
 now using SVN, with more projects in the migration queue.

* the certificate service ( has seen a lot of development
 work recently.

* we have moved DNS registrars. We are now with dotster??

* Serge's Nagios installed has been moved to and
 reconfigured to provide even more useful information.

* We have purchased and installed another PDU at our main colo in SF

30 Mar 2005 [Sander Striker]

The Infrastructure Team has gathered for an Infra-thon from
March 18th (Fri) til March 22nd (Tue).  A travel/lodging budget of
roughly $3500 was spent.  The final number is not available at this

All machines based at Mission St. have been move to Paul, the
new location of the UnitedLayer colo.  Three passes have been
requested for physical entry to the colo, for: Brian Behlendorf,
Scott Sanders and Sander Temme.

Some issues have arrised during the infrathon, but none have
cause severe downtime or unacceptable discomfort to our projects.

Due to an error on part of the team we have now rougly figured
out how much bandwidth we would use without the use of mirrors;
rougly 70Mbps.

The power distributors bought as approved by the board have not
been used.  It turned out that we can actually use the 0U PDU
that we already have, and, there is room for a second one.
We'll have to return the unused PDUs; Mirapath is working on
a quote for a new 0U PDU.

We are planning on moving all shell accounts to elsewhere
at some point.  This will be hosted under
DNS has already been updated, meaning our committers have to use to log into their shell.

Also, we managed to get hermes (our primary mailserver) failing
every 7-8 hours.  This problem seems to have disappeared after
a BIOS update and/or a kernel update.

Brutus, the current Gump machine, can remain under supervision of the
Gump PMC until we are done setting up the new Sun server, helios, and
loki, our machine purposed for running VMWare.  After that, brutus
will most likely become our secondary mailserver.  The plan was will
be shipped to The Netherlands, where it can do its work, and serve
as a fail-over in case of failure of hermes.  However, due to external
conditions that are applicable to the IBM machines we cannot ship any
of them out of the US at this point.

Minotaur, our machine usually serving our websites, subversion and
shell accounts was relieved of serving the websites prior to the colo
move.  We've left this to be the case because we wish to upgrade
the OS on that machine in the near future.  We are synchronizing
content to ajax every 4 hours.  We were planning to upgrade minotaur
to FreeBSD 5.3.  Unfortunately there were some setbacks that
prevented us from doing a backup.  We weren't feeling very lucky,
and have decided to put off the upgrade til a later point in time.

Ajax, our european hosted machine, is currently hosting most
of the websites (with the exception of tcl and perl), as well as
the wiki, jira and bugzilla.  We've switched off the indexing by
search engines for the wiki and the issue trackers since the load
on the machine was insanely high.
Ajax too is scheduled for an OS upgrade, due to the fact that is
is stuck in IO wait half the time.  We will however not be doing
this while most of our infrastructure is hosted by that machine.
It has held up pretty nicely, even when we at some point were not
using the mirrors and it was handling most of our traffic (we
peaked at 50Mbps).

Loki is going to be our machine hosting VMWare.  We've added 2GB
of memory and 2 36GB disks from one of the other machines, giving
it enough beef to actually run the ESX installation.  It's primary
purpose is going to be to host various OS instances for Gump to
run on.

Helios is our new Sun v40z.  It came with a 1TB StorEdge RAID array.
Unfortunately it was delivered without an OS installed, no install
media, and, the FibreChannel card to connect the StorEdge to the
machine was missing.  This will all be resolved, but for now we
are not able to swiftly put this box into production.  However,
when put in production this machine will host several different
so called zones.  One of these zones will be,
which will host all current shell accounts.  Every PMC will most
likely also get their own zone, which it can use as a testing
ground/showcase for their own software.  Finally Gump will be
given one or two zones for their runs.

Eris, our final machine, is stripped down to just the chassis
pending hardware to refit it.

The wiki has seen an upgrade, which was quite an undertaking.

Preparation was done for the Bugzilla upgrade.  The final
upgrade will be done at a later stage.

Work is progressing on the Eyebrowse to mod_mbox migration,
preserving the Eyebrowse urls.

Subversion is doing fairly well.  We've seen a number of repository
wedges, which seems to have nothing to do with the core
functionality of Subversion, but rather has to do with the add-on
functionality provided by ViewCVS.  We are aiming to cure the
symptoms, since we have not been able to pinpoint the cause, by
moving our Subversion backend from fs_bdb to fsfs.

All in all the infrathon has proven to be a valuable experiment.
For the future I personally would consider limiting infrathons to
software work only, defering all work involving hardware to the
locals.  The high bandwidth communication as well as the near to
full time availability of the entire team has proven to be
incredibly useful.
That said, the focus for our services will have to shift at some
point to integration and coupling as currently we are growing more
and more islands that in consequence require a relatively large
support team.  Reducing complexity for our users as well as the
Infrastructure team is definitely something to consider.

Infrastructure Team report approved as submitted by general consent.

15 Dec 2004 [Infrastructure team]

We're transitioning from nagoya to ajax.  We've submitted the final H/W list to
Sam for IBM, and are waiting to hear back from him.  We're going to proceed
with the purchase of a UPS for our co-lo rack shortly.

We are also coordinating an infrastructure get-together in SF for Q1 2005 so
that we can address the pressing large-scale items on our plate with as many
people in the same room as possible and near our servers to coordinate server
hardware and software upgrades.  Financial assistance to help bring the
participants together is desired.

Apache Infrastructure report approved as submitted by general consent.

6. Special Orders

7. Discussion Items

14 Nov 2004 [Sander]

 Infrastructure benefits a lot from the face to face meetings at
 ApacheCon.  Work is getting done.

 David Reid and Ben Laurie have been working on the CA and things
 are looking very promissing.  We will be able to offload all the
 adding and removing people from groups.  This will require that
 all services are exposed via HTTP, so there is a lot work that
 needs to be done.  With the moving of several projects to SVN,
 this goal will be easier to reach, given that shell accounts
 won't be needed anymore to do actual development.

 People do actually volunteer, but it is hard to actually parallellize
 a lot of the tasks given the centralized knowledge.

 Services are being moved around and off nagoya, since the machine
 is being retired.

 Infrastructure is budgetting $5k for the acquisition of a UPS
 for the US colo.

22 Sep 2004 [Sander]

 Sander reported that Infrastructure is finding itself
 steadily overworked, as well as there being confusion over
 how much authority Infrastructure has; he pointed to the
 mirroring policy as a prime example.  He was happy to
 report that additional volunteers have joined, especially

18 Aug 2004 [Sander]

 The Infrastructure team is battling the same problems as
 reported before.  Infrastructure needs help to get things done.
 The root rotation doens't get filled.  An obvious way out
 is hiring a (parttime) sysadmin; so that the team can focus
 on getting automation tools developed making the job less

 That said, Ken Coar offered help in writing mailing list management
 tools.  Also Geir offered to help out with the mailing lists.
 And we are happy to announce that Berin Lautenbach has joined the
 group of roots.

 Hardware wise we gained a switch contributed by Theo van Dinter.
 We are waiting for that to arrive.

 Ajax still doesn't have console access, nor do we have the
 accounts on the power switches to power cycle the box.  SurfNet
 also asked us to work out the reverse DNS on our end, which we
 haven't gotten around to yet.

 DNS is under the Infrastructure's team control again, and we
 are happy to announce that we added to more secondaries, making
 our DNS servers a bit more globally spread.

 A lot of work has been done getting VMware instances up and running
 and this seems to pay off.  Investigation in applicability is ongoing.

 Services hosted on nagoya will be moved off to other boxes, given
 the current (in)stability of some of the services on nagoya.

 eris, one of the new IBM x345s seems to be having some problems.
 Investigation is ongoing.

21 Jul 2004

 One issue - one of the new IBM boxes not functioning correctly,
 and budget is needed for switch and UPS, but still working through
 so no current action items.

21 Apr 2004 [Sander]

 Sander gave a verbal infrastructure report.  Recent events have
 included minatour disk problems; new disks have been sourced.
 The new machines are now in the racks and are currently being
 tested.  There was some discussion with United Layers over
 power consumption, but this has been resolved without requiring
 any action.

21 May 2003

[ from Brian Behlendorf ]

Major efforts:

New ASF server was paid for by ASF check and picked up by Brian.  Brian to
 set up an in-person meeting for the local ops team to check out the box,
 learn how the RAID works, etc.  To be scheduled for some time next week
 most likely.

The colocation agreement with United Layer was signed.  I'll send a copy
 to Jim for our records.  We can move in any time, billing starts a few
 weeks after we move in our first box.  First move might happen this week
 if I get my act together (Brian speaking).

Other work done by the team:

Created 22 accounts: jesse, antoine, andreas, alexmcl, egli, gregor,
 michi, edith, felix, liyanage, memo, thorsten, minchau, sterling,
 psmith, sdeboy, brudav, michal, cchew, yoavs, funkman, joerg
Removed 1 account: mehran
Updated the account creation template in the infrastructure repository,
Dealt with a couple out-of-disk-space situations
Dealt with some runaway robots
Discussed whether to change the commit-mailer to only send viewcvs URLs
 if the commit message would otherwise be too big.
Updated the Subversion installation on icarus
Fixed a content-encoding issue on the web server, and turned off SSI

Also of note:

A demo of SourceCast for has been set up, and I've shared
 access to it for a few folks.  Most of them are busy, though, so if
 others would like to give it a try, write me privately.  I'll be putting
 together a true plan for ASF evaluation over time, this is just an
 initial kick-the-tires kind of thing.

30 Oct 2002

Establish an infrastructure board committee

    The following resolution was proposed:

    WHEREAS, the Board of Directors deems it to be in the best
    interests of the Foundation and consistent with the
    Foundation's purpose to establish an ASF Board Committee
    charged with maintaining the general computing
    infrastructure of the ASF.

    NOW, THEREFORE, BE IT RESOLVED, that an ASF Board Committee,
    to be known as the "Apache Infrastructure Team", be and
    hereby is established pursuant to Bylaws of the Foundation;
    and be it further

    RESOLVED, that the Apache Infrastructure Team be and hereby
    is responsible for creating and upholding the computing
    policy for the Foundation; and be it further

    RESOLVED, that the Apache Infrastructure Team is charged
    with managing and maintaining the infrastructure resources
    of the Foundation; and be it further

    RESOLVED, that the Apache Infrastructure Team is charged
    with accepting infrastructure resource donations to the
    Foundation; and be it further

    RESOLVED, that the Apache Infrastructure Team is responsible
    for handling communication and coordination in relation to
    infrastructural issues; and be it further

    RESOLVED, that the persons listed immediately below be and
    hereby are appointed to serve as the initial members of the
    Apache Infrastructure Team:

      Brian Behlendorf (chair)
      Justin Erenkrantz
      Pier Paolo Fumagalli
      Ask Bjoern Hansen
      Aram Mirzadeh
      Steven Noels
      David Reid
      Sander Striker

 Discussion on this resolution focused around the need for such
 a Board Committee. Roy Fielding noted that such a committee
 might be best handled as a President's Committee since the
 President, rather than the Board, is in charge of operational
 aspects of the ASF. It was further discussed that such a team
 would be a good idea to create a focal point for long term
 initiatives, as a content point, and to create a sense of
 empowerment for the people interested in the technical
 infrastructure of the ASF.

 By general consent, this resolution was tabled, with a
 recommendation to the President to establish a President's
 Committee with the same goals and responsibilities.