Apache Logo
The Apache Way Contribute ASF Sponsors

This was extracted (@ 2017-11-16 00:10) from a list of minutes which have been approved by the Board.
Please Note The Board typically approves the minutes of the previous meeting at the beginning of every Board meeting; therefore, the list below does not normally contain details from the minutes of the most recent Board meeting.

2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | 1999 | Pre-organization meetings

Infrastructure

18 Oct 2017

General
=======
Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

Finances
========
Following on from last month, Infrastructure's finances are in good
shape, and running under budget on many of our cost centers. However,
we still have one open headcount with an unknown impact. We will
continue to manage against the budget.

Short Term Priorities
=====================
- Schedule Jira and Confluence upgrades
- Complete our LDAP system upgrade from master/master/many-slaves
 to a simpler master/few-slaves layout

General Activity
================
- Major Jenkins upgrade to the latest LTS release was completed.
- Rebuilt JIRA database, in preparation for its next major upgrade.
 The upgrade date is still unschedule, as we need to do some
 additional test runs.

Uptime Statistics
=================
No significant, unschedled downtime occurred during the past
reporting month. Our systems continue to run with high availability.

20 Sep 2017

General
=======
Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

Finances
========
There appears to be some strange numbers in this month's financials,
so some further analysis will be done over the next couple weeks. This
is likely a simple matter of misapplied cost centers. Will provide an
update next month.

Short Term Priorities
=====================
- Upgrade Confluence, Jira, and Jenkins. We will schedule these for
 weekends, as several hours of downtime will be required for each.

Long Range Priorities
=====================
- First priority: shift towards puppet-managed cloud-based services.
- Various automation, which likely will involve working with the
 Whimsy folks to loop this into their product.

General Activity
================
- Seven machines de-racked at OSU/OSL, as part of our mission to move
 from ASF-owned hardware to cloud-based infrastructure
- Several disks at OSL were reporting a predictive failure, so we
 ordered a few more and are working with OSL to get them installed
 and looped back into the RAID system. Even though these systems are
 on a path to decommission, a disk failure would have been much more
 expensive, than the cost of the replacement.
- Towards our goal to decommission minotaur, mbox-vm has been created
 to function as our official archive of all mail content/archives.
 Minotaur is serving this function today, with mail-search.a.o and
 mail-archive.a.o slaving their content from minotaur. mbox-vm is
 still a work in progress.

Uptime Statistics
=================
The Infra team has continued its excellent uptime record. We had no
significant downtime on any system this past month.

16 Aug 2017

General
=======
No Board-level issues at this time.

Finances
========
The InfraAdmin has been working with our Accounting team to clear up a
workflow and invoice issue with one of our vendors. Our vendor will
now sending invoices directly to Infra for review and approval.

Operations Action Items
=======================
We have a service contract with Symantec for signing binary releases.
That contract expired in June, and we are working to re-up signings
for the next year. This has been hampered because we need to
"authenticate" our business according to new rules from the CA/Browser
Forum. We are also switching the primary contact point to the
InfraAdmin and our VP Infra email address, to provide backstop points
on contract/renewal issues. The authentication process is involved,
but is progressing. With some hope, it will be solved by end of
August.  This service is primarily used by our Apache OpenMeetings and
Apache Tomcat PMCs.

Per above, our D&B records reflect our new Wakefield postal address,
but many other third-party records are out of date. These need to be
updated to ease verification of our business.

Short Term Priorities
=====================
- Complete our LDAP schema changes, and server layout.
 - The schema changes have been performed in conjunction with the
   Apache Whimsy development team to provide tooling, and design
   thoughts on our updated schema. The goal is to provide a unified
   view of Incubator podlings and TLPs.
 - Our LDAP servers are configured as a multiple-master system with
   multiple slave replicas. We will be winding this to a simpler
   single-master and a few replicas layout.
- Expanding gitbox usage to more projects.
- Improved/automated monitoring via DataDog.

Long Range Priorities
=====================
- Moving all projects over to gitbox. We are still discovering some
 edge cases in our tooling, so the mass-migration is not "now".
- Upgrades to our Confluence and Jira installations, along with moving
 them to use Atlassian's Crowd product for single-sign-on.
- Revamp of our DNS management; see below.

General Activity
================
Over the past few weeks, we've done a lot of work with the domains
that we manage. As domains are coming up for renewal, we've been
moving them to Namecheap. However, this process will likely accelerate
as Namecheap provides API-based facilities that can help our DNS
management. After some testing, it appears that we can defer/diminish
our DNS work by shifting those functions over to Namecheap.

Lots of work has been done to ensure that our historical/archival
mailing list content is correct. Thanks to Sebb and Gavin for the
grunt work to make this happen. In addition, the team is now reviewing
what remains to declare lists.apache.org as the only service for
archive access (and turning off mail-search.a.o and mail-archive.a.o).

19 Jul 2017

General
=======
Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

Operations Action Items
=======================
- Continue discussions with the Whimsy community on Foundation
 workflows, and how Infra can support their work (more below)

Short Term Priorities
=====================
- Fill the open headcount. The past few months have been an evaluation
 of the team with five FTEs, and it is (now) clear that our workload
 demands all six allocated/budgeted positions to be filled.

Long Range Priorities
=====================
- We continue to reduce technical debt by puppetizing services (which
 reduces manual configuration) and moving to cloud-based hosts and
 VMs (reducing reliance on our own hardware, and offering flexibility)
- Automation of Infra tasks is on our long-range planning, but has
 been placed at a lower priority relative to puppetizing services and
 migration to cloud-based VMs. However, the Whimsy community has
 engaged with the Infra community to add various tooling/services to
 their tool. Several workflow improvements have occurred, with more
 planned and/or waiting to be rolled out.

Jenkins Upgrade
===============
Jenkins was migrated to a new VM and VM host on Saturday, July 15.  At
the same time, Jenkins was upgraded to the latest 2.60.1. Work had
been done in preperation in writing a new jenkins_asf Puppet Module
and Yaml in readiness for this and a builds-test soaked in and allowed
tweaks to the puppet config.

A fair amount of downtime (over 10 hours) was due to a final rsync of
data after turning off Jenkins, the installation of upgraded plugins
before the overall upgrade, and another round of plugin upgrades after
the upgrade (for those plugins not compatible until new version was
installed). All nodes also had to be updated to use a JDK1.8
connection. This has consequences for those projects using Maven type
jobs and want to build with JDK1.7 and earlier, but this has been
discussed and workarounds noted.

The builds@apache.org mailing was notified many weeks ago about the
upcoming upgrade and migration, and again shortly before the upgrade
happened. During the upgrade, Twitter, builds@ and operations@ were
also kept in the loop every few hours until completion.

At this stage, builds.a.o is operating smoothly with just a few
non-infra owned nodes needing to reconnect. We may bump the VM memory
after a few days of stats collecting, but we'll see.

In terms of technical debt, this allows us to retire physical hardware
(crius) which is located at OSU/OSL. Our new Jenkins Puppet Module
also gives us much more flexibility to move, restore, and otherwise
manage the Jenkins system.

21 Jun 2017

General
=======
Infra continues to operate as expected, and there are no issues for
the President or the Board at this time.

The Infra team was able to meet last month during and after the
ApacheCon in Miami. This was a chance for the entire team to get
together to discuss work and for team bonding.

Short Term Priorities
=====================
- Discuss and possibly perform an additional Confluence upgrade to a
 more recent version
- LDAP changes to better support project membership management, and to
 support Atlassian Crowd (a single sign-on system for their products)

Long Range Priorities
=====================
- Continue retiring ASF-owned hardware, reduce technical debt, and
 update documentation/runbooks

Uptime Statistics
=================
- Confluence was taken down to perform an emergency upgrade after we
 learned of a critical CVE in the version that we are running.
- Jira was taken down for about two hours to perform a planned upgrade
- We maintained our SLA requirements, even with the upgrades

17 May 2017

General
=======
Infra is operating normally. Our main focus continues on retiring
technical debt.

Short Term Priorities
=====================
- Hire new person to fill the open position
- Team meetup at ACNA 2017 in Miami

Long Range Priorities
=====================
- Finish puppetizing all services
- Decommission all ASF-owned hardware at OSU/OSL

General Activities
==================
Migration from OSU hosted VMs is still progressing. We've graduated 2 new
TLP in the last two weeks (CarbonData, Fineract).

Uptime Statistics
=================
Overall, uptime met our SLA. We had some issues with web crawlers
abusing our services, which led to us having to block an entire /16
block from Digital Ocean. We are working with them to sort this out.

Our mail-archive.apache.org service (run on eos at OSU) had a hard drive
failure and was down for a few hours till we had the disk replaced. The
service is now up and running again.

Community Growth
================
This period, we've had one new contributor to our puppet repository,
as well as contributions from 9 people who contribute on a regular
basis. on the JIRA side of things, we had 23 new people interacting
with infra via JIRA, making it 23 new users, 102 regulars (people that
have contributed before and do so often), as well as 8 'returnees'
(people that have been absent for >2 years but are now contributing/
reporting again).

On a 3 month view, we've had code contributions from 19 people, of
which 13 were regular contributors and 9 were new. 73 people have worked
on or reported a JIRA ticket for the first time, while the other 184 who
worked on or reported issues had doen so before.

Cost-per-Project Reduction
==========================
Work has been done on estimating the cost-per-project for
Infrastructure. See https://status.apache.org/costs/ for details.
This is still at an early stage, and thus the figures are not
entirely accurate yet. We are working towards better managing
the documentation on time spent per project/component.

19 Apr 2017

General
=======
Infra is operating normally, although we are sad to see one of our
teammates depart. Our main focus continues on retiring technical debt.


Finances
========
March actuals show us running about $11k over for the FY17 year. The
FY18 budget is based on recent actuals, plus increments that were not
budgeted in prior years. In particular, costs related to staffing and
increased cloud costs due to shifting away from ASF-owned hardware in
the Open Source Labs at OSU. We continue to balance our cloud leasing
primarily between LeaseWeb, AWS, Online.Net, Hetzner, and some smaller
footprints elsewhere. Each provider has individual benefits for the
type of service that we are running.


Short Term Priorities
=====================
- Hire new person to fill the open position
- Team meetup at ACNA 2017 in Miami


Long Range Priorities
=====================
- Finish puppetizing all services
- Decommission all ASF-owned hardware at OSU/OSL


Uptime Statistics
=================
Our server running the Jenkins master (crius) experienced a hard drive
failure in its RAID array. We lost no data, but the server did become
non-responsive while the array needed to be examined. "Hands" at OSL
replaced the drive with a spare that we had on-site, and a couple days
later the RAID array was back in full operation (the array is under
heavy I/O load, so the rebuild took a surprising amount of time).

We are now looking into moving the Jenkins master to cloud hardware
sooner rather than later. The hardware failure is a perfect example of
why we want to move away from our own hardware.

15 Mar 2017

Highlights
==========
No Board or Executive action is requested at this time. Infrasturcture
is operating normally, without issue.


Finances
========
The Infrastructure team is working within the FY17 budget set by the
Board in December 2016. Planning for Fiscal Year 2018 has begun, based
on the five-year budget outlook prepared earlier.


General Activity
================
The Infra team has begun booking travel and hotel for ApacheCon in
Miami, in May. We will be using the conference for team education, for
interaction with the community and volunteers, and for team building
and face-to-face meetings.



GitHub as Master ("Gitbox")
===========================
Our work on enabling GitHub as the primary focal point for development
continues. As we add new projects, we've found/added several improvements
to our recording of provenance.


Cost-per-Project Reduction
==========================
Our work on reducing our per-project costs continues to be a second
priority, relative to our primary work of VM/machine migrations and
ramping-up of gitbox.

27 Feb 2017

Recent Issues
=============
Nothing is needed from the President, or the Board. These are reported
as an FYI only.

We have seen issues regarding SHA-1 vulnerabilities, supporting the
Apache Maven project, and changes in our build systems. Details are
provided below, in the "General Activity" section.


Finances
========
We spent $525 to purchase several months of online training. This is
an unbudgeted amount, but we believe its unlimited coursework for the
entire team, for three months, was worth the experiment. At the end of
the period, we will evaluate whether an extension is warranted.  Costs
for ongoing staff education will be included into our next budget
request.


Short Term Priorities
=====================
- Decomission ASF-owned hardware at an accelerated rate, in favor of
 cloud-provided servers
- Continue ramping up the Gitbox service
- Balance our datacenter usage for cost efficiency
- Gitbox/Jira integration


Long Range Priorities
=====================
- As reported before: continued migrations of our legacy servers and
 services into new puppet-based services that we can efficiently
 deploy to cost-effective cloud providers.
- Automation to reduce the incremental cost of regular Infra tasks
- Migration from Puppet V3 to $nextgen system for providing services


General Activity
================
We set up a new web area for the Directors to create an authoritative
set of pages for Board-approved policies and commentary. Some initial
work by Directors has populated some data/pages, but the site design
and content is still in its infancy. The plumbing appears to work, so
we're "done" and will follow with continued support.

There has been a lot of Internet discussion about Google and CWI
finding and publishing a SHA-1 collision, and their statement that it
is now possible to construct additional collisions. They will be
releasing further data in a few months.

From our initial analysis, this issue only affects our Subversion
services as a limited denial of service, instituted by an Apache
committer (NOT by a third-party attacker). The Apache Subversion
community has been discussing and analyzing the issue, including the
extent of the problem and appropriate mitigations. We have already
deployed a script developed by their community, to prevent a committer
(or a compromised account) from pushing either collision documents
into our repository.

We have confirmed that our website certificates DO NOT use the SHA-1
algorithm. This has been in place for quite a while.

This past month, we discovered that we cannot support the growth of
Apache Maven's private copy of the Maven Central repository. We
previously offered the PMC a VM to keep a copy (should M.C go dark,
we'd retain all necessary data for the ecosystem), along with CPU to
perform analysis against that copy, but looking at the storage growth
rate, we determined that this offering was not sustainable within the
current Infrastructure budget. We notified the Apache Maven PMC that
we needed to retract the offering, and for them to seek specific
budget support from the Board for their needs.

Over the past year, the Infrastructure Team has moved to a policy of
"Ubuntu Only" for our machines, to lower our costs. In the past, we
had a lot of time to support multiple operating systems, services, and
customized software deployments. With the rapid growth of the ASF, and
the resulting demand for Infra support, we have pulled back on the
edge cases to focus more strongly on ROI of our staff's work
effort. That has resulted in the Ubuntu policy, which then resulted in
the decommissioning of Solaris, Mac OS, and FreeBSD build slaves in
our buildbot service. Needless to say, that has raised concern within
several probjects who relied on the availability of those platforms.
We have made it clear that the Infra Team will integrate third-party
custom build slaves into our system, so that projects can use those
slaves for their non-Ubuntu builds.


Uptime Statistics
=================
Overall uptime reached 'three nines' with 99.9% uptime. The 'worst
offenders' this period were writeable git repositories (due to a TLS
bug) and Jenkins, though none of the critical or core services went
below 99%.

For more information, please see http://status.apache.org/sla/


Community Growth
================
This period, we've had one new contributor to our puppet repository,
Jitendra Pandey, as well as contributions from 9 people who contribute
on a regular basis. on the JIRA side of things, we had 29 new people
interacting with infra via JIRA, making it 29 new users, 93 regulars
(people that have contributed before and do so often), as well as 5
'returnees' (people that have been absent for >2 years but are now
contributing/reporting again).

On a 3 month view, we've now had code contributions from 22 people, of
which 13 were regular contributors and 9 were new. 94 people have worked
on or reported a JIRA ticket for the first time, while the other 187 who
worked on or reported issues had doen so before.


GitHub as Master ("Gitbox")
===========================
Our Git services are planned to land on gitbox.a.o, so we generally
refer to this as "gitbox". In the past month, we start moving the
OpenWhisk podling over to the gitbox service. That has been a very
slow move, so Tika and Nutch have recently been added to gitbox. It is
too soon to remark on problems and SLA for these communities using
GitHub as their master/primary focal point of development. These
communities have enough activity to help us surface and pinpoint
problems. We've made several improvements based on feedback, and still
need to implement some Jira integrations.

We will likely add more projects before the next Board meeting, and
will report on such additions.


Cost-per-Project Reduction
==========================
An important, needed clarification arose this past month, regarding
the definition of this effort. Reducing the cost-per-project is about
managing the marginal/incremental cost each time a project is
introduced to the ASF infrastructure. This effort is not about the
*overall* Infrastructure bottom line (eg. staffing, training, travel
costs) as those costs have *very* tenuous connections to the
incremental cost of a new project.

There is certainly a mild connection to hardware/hosting costs, as we
offer VM services to projects. Those VMs create a very real cost to
the ASF, and we are in-process on a way to track and allocate those costs.

The more direct costs appear to be related to the work that
Infrastructure performs when a podling is accepted, and when a podling
graduates. These events create a lot of work around managing mailing
lists, repositories, Jira, wikis, etc. This is the incremental costs
that we hope to reduce, through automation, once we are done with the
higher-priority work of VM migrations.

18 Jan 2017

Finances & Operations
=====================
We've been working with Virtual to deal with a process improvement
that results in better privacy protections for HR-related data
for our contractors and employees.


Short Term Priorities
=====================
- Get one or more projects launched on the Gitbox system
- Training of our new staffers, particularly towards VM migration
- LDAP changes to support podlings, and to integrate it within our
 supported services (eg. Sonar and Roller)


Long Range Priorities
=====================
- Move all services off ASF-owned hardware, including the difficult
 process of moving our email infrastructure
- Finish the use of puppet for all services, then explore to move
 towards upgrading to Puppet 4, and/or containers for ongoing service
 management


General Activity
================
- Tightened/streamlined git sync processes (see paragraph below)
- Ongoing conversation and development plans for integrating podling
 management into LDAP (and other ASF tooling; particularly, gitbox)
- Finalizing launching Fisheye services locally at the
 ASF to replace the third-party service run by Atlassian.
- Restructured the git repository request service (reporeq.apache.org)
 to better handle podlings, in particular assist them in setting the
 right name and notification lists.
- We suffered a catastrophic hardware failure on the physical machine
 that was hosting the application side of Jira. The service was
 fully-puppetized and relocated to a VPS. More details will be
 published after a post-mortem, during the week of the 16th.


Uptime Statistics
=================
Uptime for this month has been around 99.6% overall. Some issues with
git-wip running out of memory at times have pushed the uptime for this
down to around 98%. We are investigating the issue. blogs.apache.org
moved to a new host, which also caused a small amount of downtime.
Jira going down hard has not helped.

For more details, please visit: http://status.apache.org/sla/


Github as Master
================
The GitBox project is pending responses from the pilot projects before
it can continue. We are looking at multiple potential candidates at the
moment for this. Rather than wait on external groups, we will be using
Infrastruture's own website repository for our testing.


Git mirror/GitHub sync process
==============================
The sync process between Subversion and Writeable git repositories to
git.apache.org and onwards to GitHub has been suffering from missed
syncs lately, at an approximate rate of 1 miss out of every 10 hits. The
process has been improved and the logging also widened, so we can better
analyze any failures that may occur. The upgrades appear to have cut
down on the missed syncs by a large factor. We have, as of this writing,
had 2 missed syncs compared to the 100 or so we usually have, and both
seem to be attributed to timeouts pushing to github. The error rate
going from git-wip to the pubsub system has been reduced from around
10-20 errors per day to 0 by refactoring the pubsub agent. We continue
to monitor the situation and address any bugs that may show up.


Community Growth 2016
=====================
Since this is a new year, it might be worth looking back at 2016:
- We had 34 people contributing to our codebase (puppet) for the first
 time in their involvement with the ASF, compared to 12 people who
 regularly contribute to the repo.
- 354 new people have filed an issue with infra for the first time,
 while 329 people, who regularly work on issues, have also been
 contributing to the 2,321 issues created in 2016.
- 9 people who were previously active on JIRA have now started
 contributing code (patches etc) to Infra.


'Costs per project' Project
===========================
Unfortunately, I've been a bit swamped with other things
and this hasn't gotten many cycles. Expect more on this
issue next month.

21 Dec 2016

Finances
========
This past month has seen lots of activity in reviewing the FY17
"actuals" and projecting our overall FY17 expenditures, within the
Infrastructure section of the budget. These numbers have been
coordinated with the President and with Virtual, and are presented
elsewhere in the December Board agenda.

In short, Infrastructure is forecasting increases in staffing costs,
cloud services, and a small amount related to a once/year gathering of
the team at ApacheCon. On the other side, we're looking at lower
hardware costs as we transition from ASF-owned machines towards a more
flexible posture using virtual private servers in our cloud providers.


Short Term Priorities
=====================
- Get one or more projects launched on the Gitbox system
- Training of our new staffers, particularly towards VM migration
- LDAP changes to support podlings, and to integrate it within our
 supported services (eg. Sonar and Roller)


Long Range Priorities
=====================
- Move all services off ASF-owned hardware, including the difficult
 process of moving our email infrastructure
- Finish the use of puppet for all services, then explore to move
 towards upgrading to Puppet 4, and/or containers for ongoing service
 management


General Activity
================
- Finalizing the (emergency) move of the moin wiki
- Continued gitbox work (see separate section)
- Ongoing conversation and development plans for integrating podling
 management into LDAP (and other ASF tooling; particularly, gitbox)
- SonarQube puppetized, LDAP-enabled, and brought up. This was a big
 VM move, and will enable self-service of sonar jobs
- Continued work on puppetizing Roller (blogs.a.o) and moving that
 VM. Testing is now beginning.
- Continued puppet work and VM moves
- Beginning investigation of launching Fisheye services locally at the
 ASF to replace the third-party service run by Atlassian. Numerous
 projects use the service, but it will be shut down in early
 January. We're costing out a local replacement.
- The security-vm is in testing, with a new Jira instance. This Jira
 will be used (privately) for the Security Team and Brand Mgmt.


Uptime Statistics
=================
Uptime for this reporting cycle hit 99.9% overall, with critical
services staying at an impressive 100.0%. The usual culprits, Jenkins
and SonarQube were responsible for dragging down the overall score,
and are being replaced/updated to address this. The moin wiki, which has
been moved and cleaned up heavily has improved immensely, going from a
previous average of 94% uptime to a solid 100% uptime over the past few
months. This weekend (Dec 17-18) we were hit by an outage at NERO,
causing outages for the services hosted at OSUOSL. While the situation
has been resolved, this will reflect negatively on next month's SLA.

For more details, please visit: http://status.apache.org/sla/


Github as Master
================
The GitBox project is moving ahead as planned, and is generally
considered ready for testing with willing projects. We are in talks with
a specific project for the initial tests, and will discuss onboarding
more projects as this progresses. The services involved have been set up
and fully puppetized, and tests are showing good results here. While
this service depends on LDAP (and thus awaits the upcoming LDAP changes
for podlings), we have modified the system to work with a hardcoded list
of podling members, so we can test with podlings without having to wait
for the LDAP changes.

We have a list of things either to be done or that have been done, at:
https://pad.apache.org/p/gitbox - this outlines what we were thinking,
where we are, and what remains to be done before we can consider this
service production ready.

We invite everyone to visit https://gitbox.apache.org/ and have a look,
perhaps even try out the account linking service and provide feedback on
this.


'Costs per project' Project
===========================
As a long-term project, Infrastructure has been tasked with reducing
the "cost per project added into our systems." There has been some
work at the margins of our self-servce tools and processes, to reduce
the staff/volunteer time to provide service and to reduce resource
costs of these services (eg. locating services on more cost-effective
providers). However, we have not made any progress on computing
current per-project costs, projecting those costs, or planning
mitigation strategies.

16 Nov 2016

Additions to Infrastructure
===========================
Sebastian Bazley (sebb, apmail karma)
Freddy Barboza (fbarboza)
Chris Thistlethwaite (christ)
Christofer Dutz (cdutz)

In addition to the above karma grant; our two new staff members are ramping up.
(Freddy Barboza and Chris Thistlethwaite) Expect forthcoming blog posts from
our three recent hires on the infra blog.



Operations Action Items
=======================

Short Term Priorities
=====================
- Ensuring we have gathered enough information to begin the Gitbox
 experiments.
- Making sure the new wiki instance is running smoothly
- Introducing new staffers to the infrastructure
- Networking/F2F meetings at ApacheCon Europe
- Further work on rebranding our new web site

Long Range Priorities
=====================
- Stand up a service for mirroring repositories and events from GitHub
- Mailing list system switchover is expected to happen in the coming
 months. We are aware of a few outstanding mail-search requests that we
 have concluded could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their
 decommissioning
- Deprecate eos (currently only running mail-archives, wiki was moved)
- Work on weaving podlings into LDAP
- Further explore identity management proposals


General Activity
================
- moin wiki (wiki.apache.org) was moved to a new, faster box and
 inactive accounts were pruned to greatly increase responsiveness.
- More package consolidation and updates on the Jenkins platform
- Fixed issues with the buildbot configuration scanner not working
 as intended.
- Stood up buildbot and jenkins setup for publishing web sites via git.
- Moved more VMs from vCenter to new cloud locations.
- gitbox (stage two of our github experiment) has been stood up as an
 actual machine, with some services working already. We expect to be
 able to start mirroring and gathering push logs in the coming month.
- In conjunction with gitbox, reworked the overall design of our
 writeable git repository web interface (and received positive
 feedback).
- Reworked policy for new git repositories so that new repositories are
 automatically mirrored and have github integrations enabled by
 default.
- Worked on a new web site for infrastructure. Some components can be
 used for projects wishing to switch to a git-based automated workflow.
- Added and fixed a bunch of jenkins build nodes.
- Debugged and (hopefully??) fixed some issues with our pubsub systems
- Consolidation, general sanity checks and harmonization of Jenkins
 builds.


Ticket Response and Resolution Targets
======================================
 Stats for the current reporting cycle can be found at
 https://status.apache.org/sla/jira/?cycle=2016-11
 The tentative goal of having 90% of all tickets fully resolved in time
 is still being used. We are hoping that the onboarding of new staff
 will help greatly improve these stats.


Uptime Statistics
=================
 Uptime this month was mostly affected by the moin wiki, which we have
 now moved. We expect next month to have a much better uptime than this
 month (99.79%)

 For detailed statistics, see http://status.apache.org/sla/


Github as Master
================
Sam asked us to report on this process. To date, we've finished our
discussion with Legal Affairs to ascertain the framework that things
have to operate in.

Additionally, we've started discussions on private@ about the ASF-side
of that work. You can see the documentation from that discussion:
 https://pad.apache.org/p/gitbox

We have an instance running in AWS (the "gitbox") to run the above.

Finally, we've identified an initial project to subject to the
experiment.


'Costs per project' Project
===========================
Sam has asked to focus on looking at ways to reduce the straight
linear expansion of costs for each additional project that we add.
To date, this remains a bit of a thought experiment on both how to
measure the costs in a more granular way than we do now. Currently
the data that we have shows a correlation between increase in
requests for service, bandwidth, etc. This correlation has led to
our existing planning/staffing models, but there remains much to be
done on this front.

The goals for next month is to figure out our what specific points
we need to be measuring, and ideally looking backwards to validate
what the actual rate per project is in terms of consumption.

The goal for Q1 is to break that down, and figure out what the
constraints of the current onboarding and ongoing operations are.

19 Oct 2016

Operations Action Items:
========================


Short Term Priorities:
======================
- Work on experiment with ProxMox for virtualization to aid
 in moving VMs from the vCenter cluster, slated to be decommissioned.
- Further explorer upgrading puppetised machines to Ubuntu 16.04
- Engage with and onboard new staff once hired.

Long Range Priorities:
======================
- Explore moving podlings to separate LDAP entries, which would also
 make the MATT/GitHub experiment much easier in the long run.
- Mailing list system switchover is expected to happen in the coming
 months. A couple of outstanding tickets are pending, and we have yet
 to design a working redirect, although this is expected to be trivial.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their
 decommission,
- Moving JIRA to a new location, as the old machine is slated to be
 decommissioned.

General Activity:
=================
- Infrastructure will be present at ApacheCon Seville to interact with the
 wider ASF community.
- Maven backup repository is being moved to a different DC to save cost
- LDAP clusters are being resized to accommodate DC IP shortages
- VPN scenarios being worked on to reduce the number of public IPs used
Ticket Response and Resolution Targets:
=======================================
 Stats for the current reporting cycle can be found at
 https://status.apache.org/sla/jira/?cycle=2016-10

 The tentative goal of having 90% of all tickets fully resolved in time
 is currently not being met, but we expect the rate to go up once new
 staff has been onboarded.

Uptime Statistics:
==================
 Uptime stats have been very stable over the past few months,
 at 99.9% for critical services and 99.8% in total.

 For detailed statistics, see http://status.apache.org/sla/

21 Sep 2016

A report was expected, but not received

17 Aug 2016

Short (and tardy) report this month.

Hiring
---------

We saw the blog post announcing the open position get reposted to
several remote working job boards and have had a decent response in
terms of resumes showing up. We made a first pass over approximately 1/3 of
the incoming resumes as of this writing and hope to finish that first pass by
end of week. We've seen a number of promising resumes show up and hope to be
able to move forward with them. Hopefully we'll have material updates to
provide to this in the coming week.

General activity
-------------------

Ongoing work on Jenkins and Buildbot slaves has finally coalesced into being
completely managed via puppet. This is a huge milestone as these are
relatively complex configurations and because of the number of machines it
involves. It's also gotten some critical acclaim[1]. This should make it much
easier for folks to get additional build dependencies installed in a more
self-service manner (simple pull request against the repo compared to filing a
ticket and having that done by infra)

Work is in preparation for migrating our Jira instance off of our existing
hardware. The current hardware is approximately 6 years old, and our instance
has grown so much that the underlying database infrastructure on separate
machines is starting to become a constraint.

Growing on our blocky (infrastructure-wide IP bans for violation of rules
against one or many hosts) we've added a dashboard to both manage the rules
and blocked IPs.

Github experiment
-----------------------

Nothing material to report. Things seem to be largely 'just working'
We are working on adding an incubator project to the github experiment - we
expect this will push the limits of the project as there are literally
thousands of incubator committers, so it should be an interesting threshold to
pass and gauge our ability to cover them.



[1] https://s.apache.org/A9oF

20 Jul 2016

As reported last month, we began, and successfully completed new
contract negotiations with the contractors this month.

We are still suffering staff shortage and that continues to deleteriously
affect many things.

Infrastructure has seen some notable problems. A intermittent Jira outage
affected a large number of users. The folks at Atlassian assisted us in
diagnosing the problem and moving forward.

Additionally one of our VMware hosts has an ailing storage array. While work
was underway to evacuate all of the hosts, this storage malaise has
exacerbated the situation.

15 Jun 2016

Operations Action Items:
========================
Hiring
--------
We've published a job description and are working with the President
and Virtual to publish this widely in efforts to solicit more candidates.

Demand for Infra services
-----------------------------------

The number of projects that the Foundation is responsible for
continues to grow, and that is placing an ever increasing burden
on demands for infrastructure resources. Today, our largest constraint
is staff members to do the work.

Historically, we've had an average of 33 tickets per month per full
time staff member, and as that average grows we typically add staff.
Today we have 3.5 full time staff members - working, though statically, we
really should be at 5 just to be able to handle the ticket load. (This does not
include any time to focus on larger scale projects)

Given our current rate of growth, addressing tickets alone will require 5.5 full time
staff members by years end, and if we continue at our current pace we'll need to add
1 to 1.5 staff members every 2 years just to deal with continued growth.

Take a look at a graph demonstrating the growrth rates, demand for services
based on tickets, and staff members:
https://i.imgur.com/72V0DFN.png


Short Term Priorities:
======================
TLP
Some of the automation behind the mechanics of transforming a
graduated podling into a TLP fell into disrepair over the past few
months. This led to many exaggerated timelines for TLP graduation.
Infra held a TLP work day and while it largely remains a manual operation
there is now a runbook that is current for dealing with graduations. And
all of the pending graduations were processed. There is ongoing work
to automate large swaths of that, but for now it should only take
~30 minutes to process a newly-graduated TLP.


Long Range Priorities:
======================

Because of staffing shortage, precious little work has occurred on
our long range priorities. Instead we've moved back to what can
largely be described as firefighting and attempting to keep up
with incoming work as best as can be managed.

General Activity:
=================

Outages:
------------

We suffered a somewhat longterm outage of the Nexus repository.
We temporarily restored the service, but a service move off of the
ailing VMW infrastructure is planned for the short term future.

More recently, we suffered both a database and VMware-related
outage on the same day. Our VMware infrastructure is on increasingly
brittle hardware. We have been concerted efforts for some months on moving
VMs off of this infrastructure, and continue to do so.

New Contracts
--------------------
Following up on discussions that occurred at ApacheCon NA, we are beginning the process
of renegotiating contracts for our non-employee staff members.

Uptime Statistics:
==================
Overall we met the service uptime, though on an
individual service basis we did not meet the uptime expectations for repository.a.o.
Please see:
http://status.apache.org/sla/

18 May 2016

Operations Action Items:
========================

Staffing
-------------

Our staff is down by roughly 36% currently. That has impacted SLAs,
particularly those around response times to tickets.
Staff members have been interviewing a potential new hire.


Short Term Priorities:
======================

Code signing
-------------------

After initial plans to discontinue this service due to high cost.
we were able to succesfully negotiate a satisfactory contract
with Symantec.

Jira Spam
-------------------------

Our Jira service was again under attack this month. Starting on Tuesday
morning sometime and continuing through to Thursday lunchtime. We have
had to yet again restrict the 'jira-users' group from creating and
commenting on issues. Regular contributors/committers to projects are
still able to create and comment on tickets for projects in which they
are named in 'roles' Most committers can not yet create INFRA tickets or
tickets for other projects in which they are not in a role.

The Infra team have banned over 60 IP addresses via fail2ban triggers.
Nearly 2000 Spam tickets were created over multiple projects. Over 160
user accounts were either deleted or disabled.

We have in place automatic ban triggers for more than one account
created using the same IP address in an hour.

At this moment the restrictions are still in place whilst Infra works
out future options.

Modules created for crowd testing (in progress)

MATT/Github
-------------------
Following a sucessful (quiet) month of having both Whimsy and Traffic
Server using the git-dual system, Infrastructure is contemplating adding
more projects to the experiment, in part to test out different aspects
of the experiment that may not be fully utilized by the current
projects, and in part to increase the load on the service to see if
anything breaks when we start hitting rate limits. Infra is at present
considering adding the Beam incubator project to the test, which would
let us experiment with extremely large groups of people (universal
commit bit etc).

lists.apache.org
----------------------
At ApacheCon - we unveiled the new service https://lists.apache.org -
this service is built to replace both mail-archives.apache.org, and
mail-search.apache.org. See more details below.

Long Range Priorities:
======================

Monitoring
---------------

No material updates here due to staffing issues.

Automation
-----------------

No material updates here due to staffing issues.

Technical Debt
-----------------------
We've taken the first major step in retiring mail-search and
mail-archives. The former is particularly important in our paydown of
technical debt as it currently runs on hardware that is 8 years old and
reduces the number of operating systems that we are forced to manage. To
boot, much of the mail-search platform is undocumented and the current
Infra staff have little knowledge on how to operate those services.


General Activity:
=================
Staffing issues has made this month a bit more mechanical than normal.


Uptime Statistics:
==================

For detailed statistics, see http://status.apache.org/sla/

20 Apr 2016

Short Term Priorities:
======================
- Ensuring/monitoring that the dual git repository setup works as
 intended.
- Provision an isolated test environment (including isolated LDAP) for
 developing/testing services faster than is currently possible, under a
 separate domain (asfplayground.org). We believe this separation will
 also help projects and volunteers work on their ideas, such as the
 Syncope trial, as they can then develop and test without interfering
 with production systems.
- Looking at deploying Apache Traffic Server to help alleviate the
 troubled BugZilla instances and possibly JIRA.


Long Range Priorities:
======================
- Mailing list system switchover is expected to happen in the coming
 months. We are aware of a few outstanding mail-search requests that we
 have concluded could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their
 decommission
- Further explore identity management proposals


General Activity:
=================
- git-dual has been fully puppetized and is ready for more extensive
 testing.
- Traffic Server has joined the git-dual experiment, so far without
 incident.
- Due to severe abuse, we had to take our European archive server
 offline and make extensive restrictions on our US archive. We also had
 to put the US archive in maintenance mode for a few days in order to
 relocate it to a new machine that can better handle the load (and
 isn't 5 years old).
- Following concerns raised on infrastructure@ about the loss of history
 due to the migration of people.a.o to a new host not including the
 contents of committer's public_html directories, an infrastructure
 volunteer stepped forward to automate the copying of this data. Of the
 original ~500GB, around 80% was inappropriate (RCs, Maven repos,
 nightly builds) and was filtered out. There have been no further
 concerns raised since the copy of the remaining ~110GB was completed.

Ticket Response and Resolution Targets:
=======================================
 Stats for the current reporting cycle can be found at
 https://status.apache.org/sla/jira/?cycle=2016-04
 The tentative goal of having 90% of all tickets fully resolved in time
 is still being used. Compared to March, we have had slightly more
 tickets opened, and the percentage of tickets that hit our SLA is
 lower, as we are essentially 2 full time staffers fewer than we were in
 March.

 Quick April reporting cycle summary:
   - 230 tickets opened
   - 228 tickets resolved
   - 248 tickets applied towards our SLA
   - Average time till first response: 13 hours (up from 6h in Feb-Mar)
   - Average time till resolution: 44 hours (up from 23h in Feb-Mar)
   - Tickets fully handled in time: 90% (204/227)


Uptime Statistics:
==================
 Uptime is not pretty this month, but this is primarily an aesthetic
 issue. The biggest failure here has been keeping LDAP servers in sync,
 as we have a specific LDAP server in PNAP that keeps going 10 minutes
 out of sync with the rest. We are looking into why this happens, but
 so far we have not been able to determine the true cause. Nonetheless,
 this does not mean LDAP has been down or unresponsive, merely out of
 sync on one node.

 As mentioned earlier, we also had to pull the EU archive due to
 massive abuse, primarily from some EC2 instances and other external
 datacenters. We are working with the data centers in question to
 resolve the issue, and have also imposed a new 5GB daily download
 limit per IP on the new archive machine. Preliminary data suggests
 this new limit has reduced the traffic from archive.apache.org by as
 much as 65% (going from around 3TB/day to 1.1TB/day), further
 suggesting that the bulk of traffic from that service is to poorly
 configured VMs or other CI systems that should never have been using
 the archive in the first place. As the machine that hosted the archive
 also hosts the moin wiki and mail archives, we believe that these
 services will now perform better, in lieu of archives.a.o moving.

 For detailed statistics, see http://status.apache.org/sla/

16 Mar 2016

New Karma:
==========


Finances:
==========


Operations Action Items:
========================


Short Term Priorities:
======================
- Ensuring/monitoring that the dual git repository setup works as intended.
- Provision an isolated test environment (including isolated LDAP) for
 developing/testing services faster than is currently possible, under a
 separate domain (asfplayground.org). We believe this separation will also help
 projects and volunteers work on their ideas, such as the Syncope trial, as
 they can then develop and test without interfering with production systems.
- Looking at deploying Apache Traffic Server to help alleviate the troubled
 BugZilla instances and possibly JIRA.


Long Range Priorities:
======================
- Mailing list system switchover is expected to happen in the coming months. We
 are aware of a few outstanding mail-search requests that we have concluded
 could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission
- Further explore identity management proposals
- Further explore the MATT experiment (see General Activity)

General Activity:
=================
- Finished the initial design of git-dual.apache.org, intended for distributed
 git repositories (Whimsy experiment). Things have been fully in sync for now.
 Some discoveries and gotchas were made in setting this up, such as split-brain
 issues and canonical source requirements for the setup to sync properly - in
 particular, it seems that the 'origin' setting in each repository must be set
 to GitHub for it to sync properly both ways. There is still the issue of a
 discrepancy between emails for commits pushed to ASF and commits pushed to
 GitHub, but this is being worked on.
- Adding new members to our GitHub organisation has been fully automated and
 tied to LDAP. Any committer setting their githubUsername field through
 id.apache.org will automatically be added as a member of our organisation
 there. We are pleased to say this is/was the final step in fully automating
 the MATT experiment from the committers' side of things. All adding/removing
 of members for github repos is now automated completely, if delayed by a few
 hours due to rate limits (in anticipation of 1000+ repositories, we have
 decided to do slow updates).
- Added JIRA SLA guidelines (as well as a status page on status.apache.org, see
 below)
- We had a bad disk on coeus, our central database server, causing many services
 to be slow while we replaced the disk and resilvered the mirror over the
 weekend.


Ticket Response and Resolution Targets:
=======================================
 We have added a new SLA for tickets. We are still tweaking the parameters in
 this SLA, but the preliminary ones have been applied to our new JIRA SLA page,
 where stats for the current reporting cycle can be found at
 https://status.apache.org/sla/jira/?cycle=2016-03
 Tickets created on or after February 23rd are counted towards our new SLA. We
 have a tentative goal of having at least 90% of all tickets fully handled
 (responded to and resolved) in time.

 Quick March reporting cycle summary:
   - 178 tickets opened
   - 216 tickets resolved
   - 144 tickets applied towards our SLA (Feb 23rd onwards)
   - Average time till first response: 7 hours
   - Average time till resolution: 17 hours
   - Tickets fully handled in time: 95% (118/124)


Uptime Statistics:
==================
 Nothing out of the ordinary to report here (99.9% across the board). We
 decommissioned the use of minoatur as a web space provider and have moved
 people.apache.org to home.apache.org. This has caused a slight rise in uptime
 for standard services as the old people.apache.org was frequently experiencing
 issues.

 BugZilla has seen less abuse than before, possibly because of the measures
 taken previously. We are however exploring utilizing Apache Traffic Server to
 further ensure its future stability.

 Once again, the Moin wiki is, as always, experiencing hiccups, caused by the
 general flaw in its design. We urge any project still using this to switch to
 cwiki, if they experience slowness in the service.

 For detailed statistics, see http://status.apache.org/sla/

17 Feb 2016

New Karma:
==========
N/A

Operations Action Items:
========================
N/A

Short Term Priorities:
======================
- Setting up a new git repository server at the ASF for testing git r/w
 synchronization viability.
- Provision an isolated test environment (including isolated LDAP) for
 developing/testing services faster than is currently possible, under a
 separate domain (asfplayground.org). We believe this separation will also help
 projects and volunteers work on their ideas, such as the Syncope trial, as
 they can then develop and test without interfering with production systems
- Agreement with Apple has been executed, and projects may now publish
  apps in the iTunes store. Huge thanks to Mark Thomas who has been
  working on this for multiple years and seen it to conclusion.
- Appveyor CI is now available for projects who make use of a Github
  mirror, see for more details:
  https://blogs.apache.org/infra/entry/appveyor_ci_now_available_for

Long Range Priorities:
======================
- Mailing list system switchover is expected to happen in the coming months. We
 are aware of a few outstanding mail-search requests that we have concluded
 could be solved out-of-the-box by this.
- VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission
- Further explore identity management proposals

General Activity:
=================
- Moved the STeVe VM to a new data center in anticipation of the upcoming annual
 members meeting. Benchmarks show it could serve an election with more than
 6,000 voters, so it should be adequate for the upcoming election.
- The new machine for serving up ASF's end of the ASF->GitHub r/w repositories
 is under work, but has been slowed a bit by some difficulty in integrating
 new hooks as well as difficulty with our current environment setup (see SRPs
 above).  As stated in the SRPs, we have discussed deploying a separate test
 environment that is isolated from our production environment and would allow
 for faster development/testing. The project hinges on three phases: ACL,
 email/webhooks and dual r/w repositories with sync. The ACL has been
 automated now and is working. The email phase has turned up some faults in
 GitHub's way of determining whether a commit is unique (and thus deserves a
 diff email), and thus we are working towards replacing this with a new
 mechanism. This new method however is dependent on getting the 2xr/w repos
 up and running, which in turn depends on rewriting the git hooks we have in
 place for our current setup, and is the cause of the current slow progress.
 We expect to have a basic work flow with new hooks in the coming week, and
 once that is confirmed, we'll attach the Whimsy code to that process.
- Work continues on moving machines from our old VM boxes to our new provider.
- Fixed some issues with buildbot configurations not being applied.
- Lots of JIRA activity: 175 tickets created, 146 resolved
- Fixed spam filtering cluster so that it runs checks correctly and is able to
 catch more legitimate spam. This has already proven very effective.
- 5 machines and two arrays decommisoned and removed from OSUOSL racks.

Uptime Statistics:
==================
The new Whimsy VM has implemented a status page that we have tied into our
monitoring system (PMB). This is still not considered a production environment,
and thus does not count towards our SLA.

This month was a slightly bumpy ride for some services. While the critical
services continued their nice trend of pretty much 100% uptime, some core
services such as people.apache.org have acted very unstable, indicating that we
may need to speed up the decommissioning of this service.

BugZilla has been abused by some external services trying to scrape
everything, causing some downtime. We have implemented a connection rate
limiting module (using mod_lua) for the BZ proxy and this seems to have helped,
albeit not stopped the downtime competely.

The Moin wiki is, as always, experiencing hiccups, caused by the general flaw
in its design. We urge any project still using this to switch to cwiki, if they
experience slowness in the service.

For details statistics, see http://status.apache.org/sla/

20 Jan 2016

Things were quieter this month than normal, largely due to holidays.

General Activity:
=================
- Added new nodes to our LDAP cluster for more robustness, starting work on
 adding LDAP load balancers to prevent abusing specific nodes.
- Work continues on moving machines from our old VM boxes to our new provider,
 as well as formalizing existing setups (cinfig mgmt etc).
- GitHub organisational changes were applied, in order to continue with the
 exploratory MATT project. Committers can now be automatically added and/or
 removed from GitHub teams depending on LDAP affiliation and MFA settings.
- 120 JIRA tickets created, 156 resolved

Uptime Statistics:
==================
Continuing the positive trend since November, we are very pleased to report that
uptime across all SLA segments for this reporting cycle was above the famous
'three nines', and the overall uptime - when using the same number of decimal
places as most service providers - was at a marvelous 100.0%. That is not to say
we did not have our share of service glitches - in fact we had quite a few - but
the duration of these (a few minutes each) were too insignificant to budge the
total uptime figure.

In order to compare better with places using various decimal place settings, we
have tweaked our SLA page to accept this setting as an argument, thus:
http://status.apache.org/sla/?1 will show uptime as XX.Y% whereas
http://status.apache.org/sla/?3 will show it as XX.YYY% etc.

There are still places that we cannot or do not yet monitor, more specifically
various components in whimsy - mostly due to access restrictions -, and we are
in talks with people from the Whimsy project about utilizing a new status page
for our monitoring.

Whimsy Github Experiment
=====================
We've done some work around MATT (Merge All The Things -
http://matt.apache.org) and are relatively happy with the workflow around
definitively identifying Github and Apache accounts and merging them.
(Our process allows for folks to sign into their Github account and the Apache
account and confirm the other.

That work has been followed up with some automation work so that:
We can create groups on the fly, automagically populate them from
group membership in LDAP(predicated upon the person identifying their
Github ID with MATT), and then grant them commit access if they have MFA
(Multi-factor authentication) enabled. We've decided, at least for the time
being to mandate MFA - as we have no visibility into authn failures or other
auth attacks, and the MFA ability grants us better security than we have on
our own hosted repositories. (Currently ~1/2 of Whimsy committers do not
have commit access to the repository because they have not enabled MFA.)

We still have work today on the Github experiment, mainly around the
automation of pushes back to an ASF copy of the repositories.
We should have more to report on this front next month.

16 Dec 2015

New Karma:
==========
None this month

Finances:
==========
N/A

Operations Action Items:
========================
N/A

Short Term Priorities:
======================
- The M.A.T.T (Merge All the Things) project for Whimsy is progressing, albeit
 hindered a bit by the need to change our entire setup on GitHub. We expect
 this to be solved within the next reporting period.
- We discovered an error one of the monitoring systems we use (a faulty
 configuration) which had been preventing it from notifying us of some issues
 such as a bad disk. This has been rectified and hardware replacement should
 happen before the board meeting.
- We are in the middle of moving a lot of VMs away from our old vcenter setup to
 the new infra provider, LeaseWeb, as well as upgrading them, which will clear
 out a lot of technical debt.

Long Range Priorities:
======================
The mailing system switchover is scheduled to happen within the coming months,
with tests being performed at the moment to port lists form the old ezmlm system
to the new mailman 3 system. We are preparing to stress-/scale-test the new
setup.

We are looking into replacing our current backup plans with more affordable ones
while still retaining a hardware-managed solution. This will hopefully cut our
backup costs in half in the long run.

General Activity:
=================
* Lack of responsiveness
 We've noted that our responsiveness to email on infrastructure@ has suffered
 of late, resulting in dropped work items and failing to meet expectations.
 While we certainly don't want to manage work via email (far too much of it)
 it's clear that a number of issues were being lost in the sea of email. We've
 reinforced that the responsibility of insuring we are responsive to emails
 falls to the on-call staff member of the week, and hope to see this situation
 improve.

* people.apache.org moving to home.apache.org
 There has been some discussions going on about the decision to decommission
 the current people.apache.org server and move to a new machine. Most
 discussions have revolved around specific methods of access (rsync, scp, sftp)
 and some poeple have asked why the old data was not copied over verbatim. We
 are looking into whether RSYNC and SCP can become a reality - so far, our
 search has shown it's not an easy task to couple those with LDAP) We are not
 actively looking into copying everything from minotaur (the old host), as it
 would fill up most of the disk space with unnecessary files (very low signal
 to noise ratio).

* code signing service
 Discussion has begun about discontinuing the code signing service. Despite a large
 number of projects requesting the service, to date, only two projects are making use
 of the service. Moreover, in the past year, we've had a grand total of 34 signing events.
 The conversation on whether offering a code signing service remained pragmatic with
 such low uptake began on infrastructure@ - Symantec was notified that we plan to
 discontinue the service, and has asked to be allow to submit a less expensive option
 for our consideration.

Uptime Statistics:
==================
The November-December period has been extremely smooth sailing with uptime
across the board reaching the fabled 'three nines' (>=99.90%) and uptime for
critical services nearly hitting 100%. For more in-depth detail, please see
https://status.apache.org/sla/

18 Nov 2015

Operations Action Items:
========================
Infrastructure discovered that many projects were using git in an
innovative manner. The majority of these uses bypassed the
normal expectations that we had set about protected branches and
tags. As a temporary measure, to get us back to the same level of
assurance as expected, we disabled the ability to delete branches
and tags. This has caused a bit of murmuring as it is disruptive to the
way that many projects use git. Infrastructure awaits guidance
about policy around VCS, history, etc.

We've added a new cloud provider, Leaseweb, that should provide
some additional capacity in the US and Europe.



Short Term Priorities:
======================
We've had an increased focus on tickets in the past month.
Tickets are being created faster than we can deal with them
when viewed from a monthly perspective. Hopefully as we are
adding additional capacity we'll address this and get it back
under control.


Long Range Priorities:
======================
Automation
----------
Automation hasn't been at the top of the list this month.
Nonetheless we've made some gains in this arena. We've added
a good deal of the generic build slave and buildbot master
configuration to puppet.
We've also puppetized the new home.a.o service.

Resilience
----------
The past few months has given us a good opportunity to prove out our backups
and ability to respawn infrastructure. We've gone through multiple moves of
critical systems, most notably SVN; proving that our backups work as intended
and that configuration management works well for those hosts.


Technical Debt
--------------
Web space for committers (currently hosted on people.apache.org) is being
moved to a new home, aptly named home.apache.org, in the continued effort to
phase out the minotaur server. Committers will be given 3 months to move their
contents, after which people.apache.org will stop serving personal content and
redirect to home.apache.org. We have opted for asking people to individually
move their content due to the sheer amount of potentially unwanted old files
that currently reside on minotaur (taking up 2TB of space). This will be a web
hosting server only, shell access will not be allowed. We plan to set up a VM
for PMC Chairs later on, for performing LDAP operations, but the free-for-all
approach that minotaur has is unlikely to be offered in the future, due to the
unmaintainability of it.  Once this is in place, the only other major
component we need to remove from minotaur is our DNS service, and we can then
retire the aging machine.  We will be publicizing the above very broadly in
the coming weeks as we start the countdown clock.

Monitoring / Logging
--------------------
We have had to retire our initial unified logging cluster due to the
incredibly high amount of logs coming in (estimated 20 billion entries per
year), which was choking on disk read/write speeds, first and foremost. A new
5-node cluster with faster SSD disks has been put in as a replacement,
requiring less than 24 hours to set up and put into production thanks to our
configuration management systems and some snappy work done by the team. This
new cluster is henceforth known as 'Snappy'. We will most likely be cutting
down retention time to 3-4 months as a precautionary measure, so as to only
store somewhere around 5-6 billion records at any given time. As we mainly
need "current" logs (within the last month) for our work, this is an
acceptable compromise for us, considering the alternative would be more than
doubling the cost of the cluster.

General Activity:
=================
On-boarding the newest member of the team is ongoing, and has proven to be a
very smooth task, with every member of staff pitching in to provide guidance
and help.

The mail archive PoC has been on a hiatus, with the two lead designers being
away on travel at various times, but as the systems have been running by
themselves in the background, they have provided valuable debugging information
for optimizing and further developing them. We remain convinced that we are on
the right track in terms of software to use for the next generation of mailing
lists, as we have not seen sufficient evidence of other alternatives operating
at the same scale the ASF does.

Uptime Statistics:
==================
See http://status.apache.org/sla/ for details.

The critical services SLA was not met this month due to an LDAP outage that
caused our "LDAP Sync" checks to fail for an extended period of time. While the
LDAP service was not itself unavailable, we will be looking into better ways to
ensure that we act faster when nodes go out of sync or stop functioning. The
overall SLA was however met. In practice this had almost no effect on users, but
didn't meet our own internal standard.

21 Oct 2015

Date:
By:
Reviewed by:


New Karma:
==========
Infrastructure has added Daniel Takamori as a part-time staff member.
You can read more here: https://blogs.apache.org/infra/entry/dear_apache
This is a backfill of the part-time contract position that expired in
May of this year.

Finances:
==========
I spent some time talking with the Virtual folks about our rate of
expenditure in some specific GL Accounts - those used for replacing
hardware, cloud infrastructure, and build farm costs. While we are
currently close to plan in terms of totals, the rate of spending has
increased dramatically, and failing either a reduction in spending,
or in-kind sponsorships (one of which we are working on), we will
be over budget in those specific categories. That said, we are
significantly underbudget overall

Operations Action Items:
========================
N/A

Short Term Priorities:
======================
N/A

Long Range Priorities:
======================
Automation
----------
We've made small, regular strides in increasing our automation efforts,
 but nothing particularly report-worthy.

Resilience
----------
Much work around backups, both improving the scope as well as the quality
of our backups is ongoing. We are backing things up, but need to figure out
a retention policy, and that work is yet to be done.

Technical Debt
--------------
We've made significant strides in migrating workloads off of aging machines.
While we have a long way to go, progress is occurring. Most notably by end
of month we should have everything in place to retire the last remaining
machine at our Florida Colocation facility and cancel that contract.

Monitoring
----------
After last months significant gains, we are seeing benefits brought on
by monitoring, but not necessarily additional gains in the monitoring.

General Activity:
=================
ApacheCon Europe happened in the past month, and a number of Infra
volunteers and staff members attended.

We suffered a second LDAP outage, that required editing corrupted
data and restoring it. This caused a brief outage.

Mail system proof of concept work continues, though the largest
blocker at the moment is ensuring that a broader-than-the-ASF
community cares about Ponymail.

Uptime Statistics:
==================
See http://status.apache.org/sla/ for details.

Despite some needed LDAP maintenance that caused a brief outage, overall
we met or exceed all of the SLAs this month.

16 Sep 2015

Operations Action Items:
========================


Short Term Priorities:
======================
DDOS
-----
We experienced what we believe to be a DDoS attacking the mirror
redirection CGI script this month. This drove our 15 minute load
average to 2700+ We ended up using a much more efficient redirection
script, and redirecting all queries to the old one to the new to
resolve the issue, and it took us around 12 hours to mitigate the
attack.


LDAP
----
During the reporting period we lost an LDAP server inadvertently, and
this caused a number of services to cease being useful. However, our
alerting detected the problem in a timely manner, and thanks to our
resilient architecture and configuration management, we were able to
provision a new host and have it working again within 12 minutes. A
12 minute Mean-Time-to-Recovery is a stunning statistic.



Long Range Priorities:
======================
Automation
----------
See the monitoring section for details on how we've automated the
blocking of abusive traffic.

Resilience
-----------
We didn't add much in the way of resilience, but we have a great
example of how our resilience allowed us to quickly recover from
a failure. See the short term section above.

Technical Debt
---------------
See details of mail in the General Activity section

Monitoring
-----------
This month a lot of investment in monitoring over the past quarter has come
to fruition.

First - we have finally promoted centralized logging into production. This has
given us tremendous additional insight thanks to the visualization and query
tools that are now available.

We did run into some scaling issues with the 'preferred logging tool' and were
able to move to a very simple python-based log-ingester, that works on both our
new puppet-managed machines as well as our legacy machines.

Once we had data in place, and ability to run analysis on the fly,  we
immediately saw a number of situations where our services were being abused.
Eventually, we determined that we could programmatically deal with a number of
these issues, across all of our machines. To that end, we've now deployed a
tool called blocky that, based on input from our logging system automatically
blocks IP addresses across our entire infrastructure. We have a catalog of how
blocking this abusive behavior has dramatically reduced bandwidth usage, in
one case 30% of a server's total bandwidth was caused by abuse.

In addition to the technical benefits, we can also provide insight to projects
and fundraising for how much traffic is visiting our web properties, or even a
specific project's site, and where that traffic is coming from and what they
are doing most often.

General Activity:
=================

Mail Phase 2
-------------

Mail has been interesting, we went from a very promising POC, to realizing at
least one component would likely not be able to scale to match our historical
load, much less be able to scale to the future. In general, while we don't
currently see any blockers, we are hearing of troubling experiences from
others.

In response to that, we've developed a prototype of a replacement called
Ponymail.  It can certainly handle the load. Our plan is to move this software
to the ASF, and ensure that it can develope a community around it. We've
already called attention to it with some similar organizations who are going
through mailman3 POCs.  We will not adopt this software if a community of
folks other than ASF Infra who cares and helps to develop this software. We
want to make sure we aren't replacing an aging system with additional
technical debt that will come back to haunt us in 5-7 years.


Uptime Statistics:
==================

We met this months service level expectations
http://status.apache.org/sla/

19 Aug 2015

New Karma:
==========
None


Finances:
==========
LastPass: $168
Dinner meeting with Lance from OSUOSL $65.50
Cloud Services: $4619

(The cloud services cost was somewhat inflated by the fact that we purchased
reserved instances (roughly $1k was due to this.) However, it reduces our long
term payout somewhat dramatically)


Operations Action Items:
========================


Short Term Priorities:
======================
- Sort out mail archives not updating for certain lists
- Figure out and solve emails supposedly being denied from certain senders


Long Range Priorities:
======================
- Mailman/HyperKitty proof-of-concept, implement ASF account design into it
- Revisit unified logging with a new machine setup
- Fully deprecate the remaining few large non-config-mgmt boxes


General Activity:
=================

Mail archives
-------------
An issue has arisen where certain mailing list archives have not been
updating with emails from August. The issue has been narrowed down to the
mod_mbox list database not updating, despite the raw data being present on
the archive servers. While the investigation is not complete, we have
uncovered some permission and shell environment issues that are related to
it, and expect to be able to solve this issue before the next report.

infra@ ML split
---------------
It has been suggested, and agreed upon, that the infrastructure mailing list
be split into a development and a commit part, so as to not burden people,
who are only interested in the former part, with the latter. This split is
expected to be implemented in the coming weeks.

infra@apache.org                   -> dev@infra.apache.org
infra-dev@apache.org               -> dev@infra.apache.org
infra-cvs@apache.org               -> commits@infra.apache.org
infrastructure-private@apache.org  -> private@infra.apache.org
root@apache.org                    -> root@infra.apache.org

The root@ address is not a list (but an alias) and will remain so.

We intend to keep the old addresses working, but forward them to the new
list addresses. Also, as infra@ is currently privately archived we will not
make those archives public. Going forward however, they will of course be
made public.


Mailman3/Hyperkitty update
--------------------------
Work has been progressing, and the most recent activity has been a discussion
on how we will implement authentication and maintain the concept of
/private-arch/ (where foundation members are entitled to interrogate any
private mailing list). This feature is fairly unique to the ASF and as such
no mailing list provider has this capability natively.

The PoC has been stood up, and is accepting mails for a couple of test domains
as soon as the platform is ready for people to look at it, and test it,
further information will be shared at that time.

Monitoring
-----------
We're now using Datadog as a SaaS monitoring service. It's done a decent job
of getting us a decent baseline of metrics. This has given us an increased
level of visibility, at least into the systems that are managed by
configuration management.

Uptime Statistics:
==================
Going forward, uptime statistics, as they relate to our SLAs, have now been
fully automated and can be found at: http://status.apache.org/sla/

While primarily done to save us from using 3-4 days of contractor time on
these statistics every year, we also felt that there were no compelling
reasons to not have this publicly available ahead of report time, as both
our SLAs and the uptime data itself have (technically) been publicly available
for a long time in its raw form.

To sum up quickly, all service level goals have been reached for the past
5 reporting cycles:

---------------------------------------------------------------------------
Cycle:     | Critical (>=99.5%) | Core (>=99%) | Standard (>=95%) | Average
---------------------------------------------------------------------------
Mar - Apr  |   99.52%           |  99.81%      |   99.14%         |  99.58%
Apr - May  |   99.83%           |  99.96%      |   99.22%         |  99.75%
May - Jun  |   99.72%           |  99.67%      |   99.32%         |  99.59%
Jun - Jul  |   99.50%           |  99.13%      |   99.65%         |  99.34%
Jul - Aug  |   99.50%           |  99.87%      |   99.80%         |  99.77%
---------------------------------------------------------------------------

Contractor Details:
===================

15 Jul 2015

New Karma:
==========
None

Finances:
==========
$2930 - Rackspace cloud
$60 - Dotster
$66 - Hetzner
$3336 - AWS

Operations Action Items:
========================
N/A


Long Range Priorities:
======================

Automation
-----------

While the general spread of configuration management continues,
the major increase has been with adding Confluence as a puppet
managed service. The Buildbot master and Jenkins master have
started the move to being completely managed by configuration management.

Resilience
----------

This month we've reduced the disparate number of SSL proxies. We
now have three, identical, SSL end points, all capable of serving
the same content. With some upcoming work around GSLB, this should prepare us
to have an additional level of resilience for the many
services behind the proxies.

Monitoring
----------

We didn't make much progress on this front this month.

Technical Debt
--------------

Our GeoDNS instance failed this month. This service routes
users to the closest web server, SVN deployment, or rsync host based on where
they are in the world.  The underlying host went offline and we were
unable to resurrect it. We are awaiting assistance from smart hands
in Europe to do so. In the meantime we've disabled the service, which is
shunting all users to single instance of thse services. We'll be
moving this service to an external provider with substantial
redundancy in the future.

General Activity:
=================

Bugbash
-------

Infrastructure has run two bugbashes in the past month. Over the
course of both days, 39 bugs were resolved.

Buildbot Security Issue
-----------------------

As noted in the blog, we were alerted to some aberrant
network traffic originating from our buildbot master.
While we were able to fix the underlying issue we
also realized that an abundance of caution dictated that we should rebuild the
machine. We also were approaching the EOL of the hardware. This led to the
decision to take the pain and rebuild it. As of this writing the post-mortem
on the incident hasn't occurred.
But, findings will be reported when it happens.


Jenkins
-------

Our Jenkins master has been suffering from disk I/O issues and we
opted to change the underlying file system. While we
were down for that operation we took the opportunity to begin the process of
puppetizing the host. The restoration of data
to the host took much longer than planned, but the final
outcome appears to be performing much better.

Uptime Statistics:
==================

Contractor Details:
===================

Gavin McDonald

- Oncall Duties: remove some snapshots on hades to free up some disk space.
We are only keeping around 1 months worth of snapshots on all repositories now
due to the space issue.
- Upgrade Confluence Wiki to latest version - documentation created in cwiki
- Migrate Buildbot to a new buildbot-vm - documentation created in cwiki
- Begin work on and local testing of buildbot master module
- 66 Jira tickets worked on closing 53 - some longstanding CMS related tickets
resolved.

17 Jun 2015

New Karma:
==========
n/a

Finances:
==========
$3150 - cloud related expenses
$17.49 - domain renewals

Operations Action Items:
========================
n/a

Short Term Priorities:
======================

SSL
---
The issues reported last month seem to have been dealt with
earlier in the month.

SVN
---
Disk usage on our SVN host became an issue. We attempted a
service migration to a cloud-based host with more storage.
We migrated the service and then discovering problems that
weren't found in testing; rolled back.

Long Range Priorities:
======================

Automation
----------
Moving to RTC for our configuration management code has been
an interesting exercise, it certainly hasn't found all of our
issues but is forcing cognizance of what and how folks are getting
things done. In addition we are seeing a number of issues fixed
during review, before they get pressed into service.

Resilience
----------
We made little progress on improving our resiliency this month.

Technical Debt
--------------
We made little progress on reducing our technical debt.

Monitoring
----------
This month has seen a sharp rise in false positives in monitoring.
This naturally adds to the workload of the oncall person, and is
frustrating. Work is ongoing to reduce the amount of false positives
that wakes folks up.


General Activity:
=================
As of June 1, Chris Lambertus and Geoff Coreys are now
employees within the Virtual PEO.


Uptime Statistics:
==================

20 May 2015

New Karma:
==========
n/a

Finances:
==========
$3013 - related to ApacheCon
$422 - hardware replacement
$2350 - Cloud expenses
$60 - Registrar fees

Operations Action Items:
========================
This month has seen work around onboarding the two US-based
contractors as employees. This involves background checks,
reference checks, as well as paperwork. That looks to be
largely complete. It is likely that the two US-based contractors
will be employees effective June 1.

Short Term Priorities:
======================

Websites
----------
Git-based websites are now possible, and seem relatively popular.
8 projects are now using the gitwcsub services, and we are
receiving ~2 requests per week.

SSL
----
SHA-1 based chains have been deprecated. Chrome and Windows now
show alerts for SHA-1 based certs, or certs with SHA-1 certs in the
chain.  Due to this we spent a good amount of time swapping
out SSL certs this month. Despite switching out, this hasn't been
completely trouble free. Some git binaries for Mac or Windows seem
to be having difficulty, and this is a problem that continues to be
worked.

Long Range Priorities:
======================

Automation
----------
This month has seen some automated testing enabled for our
configuration management repo. In addition, we've moved to RTC
from CTR for most changes - while this isn't final, the change
at this point appears positive.

Resilience
-----------
The big boost this month in resilience comes in the mail
infrastructure. See the comments in the mail section.

Technical Debt
---------------


Monitoring
------------
While most of our monitoring has been working we continue to
have issues on our legacy systems that leave us without an
ability to monitor. Additionally, some of our applications
are needed deeper introspection for specific functionality,
and we have yet to cope with that.

General Activity:
=================

Mail
-----
This month has seen the culmination of phase 1 of our mail
overhaul. Many thanks to the SpamAssassin community and
Kevin McGrail in particular for providing insight into
what they see as best practice. Phase one is focused on our
MXes, spam and virus processing. In some ways, we've overbuilt
the current deployment - our new architecture can scale horizontally
allowing us to handle many times our current mail load.

Much work continues to be done on this front, and should be seen
as phase 2 materializes.

Uptime Statistics:
==================

Overall, the total uptime for 2015 increased by 0.02% this month, in what has
generally been a quiet month in terms of emergency maintenance and downtime.
The few services that performed badly this month have been mentioned in
earlier reports, and steps are being taken to increase the availability of
these services in the long run.


Type:                Target:   Reality, total:  Reality, month:  Target Met:
----------------------------------------------------------------------------
Critical services:   99.50%             99.62%           99.83%      Yes/No
Core services:       99.00%             99.88%           99.95%      Yes
Standard services:   95.00%             98.96%           98.91%      Yes
----------------------------------------------------------------------------
Overall:             98.59%             99.57%           99.64%      Yes
----------------------------------------------------------------------------

Contractor Details:
===================

Gavin McDonald

- Oncall Duties: Whimsy died 04/24 and needed a reboot. Crius disk space reached
 75%.
- Some buildbot slaves offline, look into issues and bring back online.
- Look into various projects Buildbot failures and resolve, liaising with the
 projects as necessary.
- Reviewing and Merging others branches to deployment.
- Crius and Hemera were 100+ packages behind, 3/4 of which were security
 updates.
- Updated both then enabled auto-patching of security.
- More work on blogs_vm module, with others checking and merging
- Setting up a more robust local puppet module testing env
- Investigate and fix PagerDuty alerts for various services
- Investigate Moin Wiki ongoing issues and start to compile a report.
- Reduce Crius disk space to 70%
- Clear space on analysis-vm (sonar)
- 57 Jira tickets worked on closing 40
- Various Cron mails looked into and resolved (mainly new cert errors)

22 Apr 2015

New Karma:
==========
None

Finances:
==========
$53 domain renewals
$2165 Cloud Services


Operations Action Items:
========================
n/a

Short Term Priorities:
=======================

LDAP
----
We've increased our spread of LDAP so that all of our cloud regions now
possess an LDAP server for authentication to work over the local network.

CMS
----
We've generated a FAQ to catalog issues and questions that came up during
the RFC. We hope to have some direction during or shortly after ApacheCon.

Long Range Priorities:
======================

Automation
-----------
We've been working on a number of automation efforts. One of our
recent deployments of LDAP has proven we can deploy a new host in
less than 10 minutes from provisioning to functional service.
Additionally we've been working on moving more services. One of
the highlights this month includes the SteVe deployment. We now
have the service in a state where it is trivial for us to deploy
a fresh machine with a STeVe deployment for projects to use.

Resilience
----------
We've begun moving VMs off of our internal VMware deployment. At the
same time we have been working on spinning up our VMware-based cloud
deployment at PhoenixNAP.

Technical Debt
--------------
We suspect (but are unable to prove) that some of our VMware issues
are related to using EOL/EOS software from VMware that requires us
to run the deployment for at least one machine in a suboptimal
manner.

Monitoring
----------
While monitoring has been useful in identifying issues, we haven't
materially expanded its use this month, something we hope to remedy
this month.

General Activity:
=================
Incidents
------------
We had a total of 98 incidents in the month of March that we alerted on
and paged the on call contractor out for. The relatively high number has
caused some concern, and we are tracking ways to reduce the number
of alerted incidents to the truly severe.
http://s.apache.org/WjC
http://s.apache.org/jQs

Jira
-----
A number of Jira imports have successfully been handled. These include the
plethora of Maven project imports, Tinkerpop, and Groovy among others. The
bulk of this work has been done by Mark Thomas who deserves special thanks
for tackling this work.

VMware hosts
-------------
One of our VMware hosts has been suffering from intermittent network
failures as well as the entire machine dying. This originally appeared
to be time related. We've begun migrating services off of this host in
order of priority, but the outages have affected a number of important
services to include the git services as well as blogs.

Uptime Statistics:
==================

Contractor Details:
===================

Gavin McDonald
--------------

Engage with OpenOffice community regarding a few Infrastructure related issues, such as the Mac Buildbots

Revisit and engage with CouchDB PMC to sort out 2 outstanding Domain Name
transfers.  Sort out couch.[org|com] domains with Dotster, Domains are
successfully transferred.  Next step is to sort out secondary DNS and httpd
configs.

Covered on-call whilst folks were travelling to ACNA

Jira Tickets: 68 worked on and 56 competed as of 04/20/15

Confluence Wiki migration was completed.

Began work on moving the Blogs service to Puppet 3 and a new home in the cloud.
Began work on preparing to migrate TLP playgrounds to Puppet 3 and the cloud.

Prepared a draft policy covering VCS canonical locations.

on 4/17 qmail stopped sending mails and the queue went above 100000. Resolved
issues and service running normally again.

commonsrdf cms setup refuses to create a staging site. Looked extensively into
the problem but remains unresolved at this time. This is a blocker for the
project as they have no website and are waiting on it to do a release. See:
INFRA-9260

18 Mar 2015

New Karma:
==========
N/A

Finances:
==========
$135.40 - Hetzner.de
$285.00 - Silicon Mechanics
$1267.11 - Amazon Web Services
$17.49 - Dotster

Operations Action Items:
========================
N/A

Short Term Priorities:
======================
Codesigning
-----------
There have been several signing events in the past month.
4 total signing events for Tomcat 8.0.19 and 8.0.20.
1 total signing event for OpenMeetings 3.0.4

Machine deprecation
-------------------
We continue to make progress in moving services off of some
of our oldest hosts.

VMware issues
-------------
We experienced repeated network failures on one of our newer
VMware hosts. This took down most or all services on the host
repeatedly. This appears to be related to time issues, and seems
to have settled down, though we continue to watch it closely.

Backups
-------
The new backup service is moving along well. 7 of our hosts are
currently backed up by the new service. Our goal is to have all
of the hosts backed up by end of the year and deprecate our
Florida colo and the machine running there.

Long Range Priorities:
======================
Automation
----------
We've made good progress on a number of automation fronts.
The work to automate Confluence has been done, and the service will
move hosts in the coming weeks.
A number of our SSL Endpoints have now been puppetized, as have services
like asfbot, svngit2jira, and gitpubsub.

Resilience
----------
We haven't made a lot of progress on this front in the past month aside from
the ongoing automation efforts.

Technial Debt
-------------
We've run into a bit of technical debt surrounding CGI pages. This had a
deleterious affect on a number of projects. Several pieces of debt from
this were discovered. The first was that a contractor had been manually
setting the executable bit across hosts. The second was that the CMS was
not picking up and communicating permission changes even when set
appropriately. There had been a custom CGI module written in the past, and
we initially spent a large amount of time trying to get that to work in our
new environment to much frustration. In the end, we decided that having to
bear responsibility to maintain a custom module was more technical debt that
we were amassing. In the end we set the executable bit for a number of
projects by using svnadmin karma. We believe these issues to now be resolved.

Monitoring
----------
In the past month we've uncovered a few holes in our monitoring that has
resulted in users pointing out problems to us. We are still working to
fill those holes.

General Activity:
=================

CMS
-----
As mentioned in last month's report; we are struggling to find a solution
for the CMS which is on an aging physical host. We've sent a request for
comments on some of our proposed solutions to PMCs and operations@
http://s.apache.org/iL4

Uptime Statistics:
==================

Overall, the total uptime for 2015 fell by 0.05% this month, mostly
contributed to by some issues with one of the VMWare hosts, as well as
people.apache.org growing slightly more unstable (though nothing alarming).
We currently have two services with uptime below the SLA (people.apache.org
with 98.83% and the moin moin wiki, with 93.31%), while the remaining 26 in
the uptime sample are above the SLA. The bad performance of the wiki is due
the fact that the moin moin wiki was not designed to scale well, thus
resulting in a lot of >30s response times - not errors per se, but still
counted towards its uptime.


Type:                Target:   Reality, total:  Reality, month:  Target Met:
----------------------------------------------------------------------------
Critical services:   99.50%             99.65%           99.44%      Yes/No
Core services:       99.00%             99.83%           99.84%      Yes
Standard services:   95.00%             98.95%           98.82%      Yes
----------------------------------------------------------------------------
Overall:             98.59%             99.55%           99.47%      Yes
----------------------------------------------------------------------------


Contractor Details:
===================


Daniel Gruno:

 Non-JIRA activities:
   - Moved (and puppetized) services from urd (baldr) to a new host:
       - gitpubsub
       - svngit2jira
       - asfbot
   - Weaved reporter.apache.org into Marvin
   - Helped debug and fix issues with Marvin not working
   - Fixed some issues with the URL shorterner (bad regex)
   - Updated documentation for Git services (git-wip + mirrors)
   - Fixed API (3rd party change) for our status page
   - Fixed an issue with the project services monitoring not firing off
     alerts
   - Set up a replacement for aurora for www-sites.
   - Experimented with the CouchDB project on a new gitwcsub service for web
     sites, allowing projects to use git for their web sites instead of svn
   - On-call duties


Geoffrey Corey:
 - Resolved 31 JIRA tickets
 - Fix mail relaying from hosts not in OSU network (bugzilla host)
 - Work with AOO to get back missing buttons/options in their bugzilla
 - Work on back log of git related JIRA tickets
 - Work with GitHub support to turn back on the mirroring service
 - Create mysql instances in PhoenixNAP for new VMs in there
 - Migrate pkgrepo to bintray (still need to figure out GPG signing in a
   proper way)
 - Work with Gavin on the Puppet 3 confluence wiki module and its
   dependencies
 - Create puppet manifest for erebus-ssl terminator/proxy rebuild and finish
   up testing for it
 - On-call related duties

Chris Lambertus:
 - time off due to personal/child care
 - On call duties
 - General assistance to other contractors
 - Troubleshooting and diagnostics for ongoing lucene git issues
 - troubleshooting and diagnostics for ongoing eirene VM host issues
 - completed base functionality of zmanda project
 - resolved issues with zmanda+vmware connectivity
 - resolved issues with abi reaching storage limits (add monitoring)
 - extensive tuning of zmanda system

Gavin McDonald
 - Worked on 30 tickets closing 23
 - More Confluence wiki puppetesation
 - Migration work of Confluence to a new new home (99% complete.)
 - Various other general support, looking at VM issues

18 Feb 2015

New Karma:
==========
none

Finances:
==========
17.49  - Domain name renewals
979.12 - Amazon Web Services

As a side note, we expect to spend dramatically less on hardwware than
we originally budgeted. This difference is coming about due to a number
of issues: Our build farm has been dramatically subsidized thanks to
Yahoo who have provided ~30 physical machines (a long with hosting the
hardware and providing smarthands support). We also now have multiple
cloud providers giving us extensive credits, and we are moving a number
of services to public cloud providers. This does mean a slight change
from capital expenses to operational expenses, though I don't think that
from our perspective that it matters much.

Operations Action Items:
========================
n/a

Short Term Priorities:
======================

Codesigning
-----------
Tomcat generated two signing events this month, both for Tomcat 8.0.18

Machine deprecation
-------------------
We've made significant progress in moving services off of some of our oldest hosts.
In doing so we've also have spent a good chunk of time automating these
services and making our deployment more robust. See more on this issue
in the section on Automation in Long Range Priorities.

Backups
-------
We are just now beginning the deployment of a new centralized backup service.
The client installation as well as the server has been autoamted, expect
to see more in this space in the coming months.

LDAP
----
The LDAP service has largely been rebuilt from the ground up. For
background, the old machine that formerly served as our svn master was
also one of our LDAP machines, and when it failed, we were down to one
very poorly performing instance in the US. Subsequently, we've rebuilt
the entire service using configuration management, we again have two
LDAP hosts in OSUOSL, that are easily handling the load. This has sped
up many of the authn/authz actions that were slow last month. We've also
deployed new LDAP instances to several of our cloud zones. While the
services seems to be working well at this point, we did have a few
hiccups, where the old LDAP servers were removing newly created
accounts. The older LDAP servers were left in place after we repeatedly
found a number of services that had either minotaur or harmonia
hardcoded as LDAP servers. Because of these problems we removed the
legacy instances, and are dealing with the problems caused as we find
them.

Bugzilla
--------
Our three Bugzilla instances were still running on a 6 year old machine
but have since been successfully puppetized and migrated to VMs in one of
our cloud accounts. In the process we've worked fastidiously on improving
the software deployment mechanism (software is now deployed as an
OS-native package).

Long Range Priorities:
======================

Automation
----------
Significant progress to report this month. As indicated above, LDAP
machines are all under configuration management. Additionally, we migrated
all three of the Bugzilla instances and the git repositories to being
completely managed by configuration management. After finding some
problems with some of our CM-managed instances, we've adopted a process of
destroying and recreating a service as a verification step prior to
pressing a service into production status.
Naturally the new services we are bringing online, like the backup service
are all managed under CM.

Technical Debt
--------------
The move of LDAP has uncovered a lot of hardcoded values and we've been
working to pay that off. (referring to service names rather than
specific machine identities, putting configuration in configuration
management where possible.

General Activity:
=================

Bintray
-------
As of January 30th, we enabled Cassandra's debian repository on bintray
and have been monitoring it closely. The service seems to be working
well and appears to have fulfilled the needs of reducing our overall
webserver traffic and has the bonus of giving more insight into the
downlaods. You can see some of the statistics on the dashboard here:
https://s.apache.org/bintray1
https://s.apache.org/bintray2

Maven
-----

A lot has been happening around Maven this period.
We've successfully been able to sync a copy of the Maven central
repository, and are working to provide access to that store for the Maven
PMC.
Additionally, Mark Thomas along with folks from the Maven PMC have been
working on migrating the Maven contents of the Codehaus Jira instance to
the ASF instance. Good progress has been made here, but much remains to be
done.


Uptime Statistics:
==================

We experienced an issue where status.apache.org was reporting erroneous uptime
statistics due to two unused LDAP checks, which unfortunately made it into the
weekly ASF blog posts. Other than that, services have been running fairly
smoothly aside from the Moin Moin Wiki which has experienced high load times
for a couple of weeks. We are discussing what to do to remedy this. Overall,
the total uptime for this year grew by 0.03%:


Type:                Target:   Reality, total:  Reality, month:  Target Met:
----------------------------------------------------------------------------
Critical services:   99.50%             99.78%           99.82%      Yes
Core services:       99.00%             99.83%           99.92%      Yes
Standard services:   95.00%             99.02%           98.93%      Yes
----------------------------------------------------------------------------
Overall:             98.59%             99.60%           99.63%      Yes
----------------------------------------------------------------------------



Contractor Details:
===================

Chris Lambertus

 - on call duties
 - closed 3 jira issues
 - extensive work on centralized backup deployment
   - ongoing testing and validation of zmanda evaluation
   - puppet work to build fully configuration management deployed host
 - resolved hardware issues with oceanus/FUB/Dell Germany
 - ubuntu libc vulnerability patching

Geoffrey Corey:
 - Resolved 25 JIRA tickets
 - Build out supporting environment in Puppet for bugzilla migration (sql
   database, webserver/proxy, bugzilla package building, etc)
 - Migrate and deploy bugzilla instances off baldr and into VMs
 - Investigate with others about missing LDAP accounts (and subsequently recreate)
   after LDAP server rebuilds
 - Fix dist.apache.org authorization template regeneration (related to svn
   master rebuild)
 - Work on getting postfix alias management in Puppet
 - Begin learning buildbot related things from Gavin

Daniel Gruno:
 - Resolved 30 JIRA tickets
 - On-call duties
 - Worked on GitWcSub, the git version of SvnWcSub for potentially enabling git
   repos to act as web site sources
 - Puppetized and tested deployment of GitWcSub
 - Fixed a bunch of issues with the main MTA
 - Worked with others to resolve the aftermaths of the LDAP network redesign
 - Fixed some issues with uptime reporting and alert statuses on
   status.apache.org
 - Fixed some issues with hardcoded values in the PMC management tools
 - Reached out to contacts about the DNS system overhaul

Tony Stevenson:
 - A lot of my time has been spent on 3 major tasks:
   - The rebuild of the LDAP service due to the poorly performing incumbent
     instances. This combined with the retirement of eris (old svn-master)
     following a terminal hardware fault, we were limited to 1 LDAP instance
     in the US and proved too much for minotaur to cope with. It appears that
     the version of slapd on FreeBSD on minotaur leaked memory at a phenomonal
     rate.  The new LDAP service has been moved to the latest slapd available
     in Ubuntu 14.04, it has been fully built using puppet, and configured so
     that new LDAP hosts can be added with ease.
   - Continue preparation and understanding for a rebuild the email
     infrastructure. This has mostly taken a back seat but has been brought up
     to the top of my todo list now following the completion of the LDAP task
     above, and the CMS task below.
   - In line with our current policy of retiring hardware that is over 4 years
     old, and trying to make all services pupept managed; David asked me to
     review the CMS zone on baldr (which is now >7 years old) to ensure we can
     move the service and have it managed with puppet. However upon
     investigation it quickly became apparent moving the service was far from
     trivial given the requirement for ZFS alone.  Further reading of the code
     highlighted some areas of concern for me that I felt needed highlighting
     as they would likely carry over technical debt into the future and that
     is something we are working extremely hard to remove.  On the back of
     these findings and my inherited knowledge of the service I presented
     David with 3 options of how I thought we could manage the service going
     forward along with the estimated costs, and the pro's and con's of each
     option. There were:
     1 - Move CMS to another FreeBSD host - undesireable given our current
         trajectory of moving away from FreeBSD, it meant entirely replicating
         the FreeBSD 9 jail too. This was the path of least change, but
         perhaps most difficult.
     2 - Move the service to Ubuntu and fully puppet controlled the beginning.
         This might have been the ideal scenario, but the underlying hardcoded
         FreeBSD aspects, the need for ZFS (which is present in Ubuntu), and
         the very specific perl that is in place I felt this would take a long
         time to complete and would be significantly error prone. We would
         very likely miss something and this would need to fixed on demand. My
         confidence level was low that we could execute a clean migration.
         This would be the most time consuming option, but if retention of the
         CMS was important the best option.
     3 - Deprecate the CMS, and allow projects to determine their own
         publishing/transformations options. We would still require use of
         pubsub technology, but projects can commit into that their HTML,
         either directly edited or dervied from markdown whcih they can keep
         in their project repo (some of the finer details will need to be
         worked out later. This would essentially be the cheapest option,
         remove technical debt, and while the timeline would be many months
         contractor/volunteer time could be kept to a minimum.

21 Jan 2015

New Karma:
==========


Finances:
==========
1429.14 - Amazon Web Services
3969.00 - Carbonite/Zmanda

As a side note, the new cards have arrived, which provide a much
better level of insight into spending; many thanks to the office
of Treasurer and EA for chasing this.

Operations Action Items:
========================

N/A

Short Term Priorities:
======================

Codesigning
-----------
Another project has requested code signing functionality. (UIMA
INFRA-9002).
Four signing events occurred in the month. Two events each for Tomcat
8.0.16 and 8.0.17.

Machine deprecation
-------------------
Work continues (and was slightly hindered by the holidays) on
deprecating the host the runs the writable git service and bugzilla.

Backups
-------
Over the past several months we've found a number of services where
backups were either failing or not happening at all. We've spent a
good deal of time focusing on auditing backups and looking for a new
solution that gives us better visibility into the success or failure
of backup jobs. To that end we've selected Zmanda as our platform
of choice and have begun deploying it.

LDAP
----

LDAP has emerged as a priority during this month. The loss of the
machine that served as the svn master last month reduced the number of
LDAP servers in our Oregon Colo to 1, and that instance is
consistently under heavy load, and logins to most services are taking
significantly longer as a result. We've had a lot of work in process,
and only recently began tackling the issue.

Long Range Priorities:
======================

Monitoring
----------

Work is continuing on monitoring, with a good leap forward this month.
We are taking advantage of (and contrbuting to) a project by the name
of dsnmp that queries the Dell OpenManage SNMP/WBEM frameworks as well
as the overall operating system health for machine. This provides
status checking to alert us to issues that have frequently resulted in
outages or service degradation. This month we were alerted to multiple
issues that we were able to address before they resulted in outages.
This is not a panacea, nor are we done with monitoring efforts, but we
are in a much better position now.

Automation
----------
Automation progress continued, though slowed somewhat by the holidays.
The writable git service is now handled by configuration management.

Currently the following services are in progress or in final stages of
testing:
blogs.a.o
Bugzilla

Resilience
----------

We haven't made much progress on this front in the past month, aside
from th ongoing automation efforts.

Technical Debt
--------------

As part of our automation efforts, we've been able to generate
recreatable Debian packages for our customized version of Bugzilla.
The long term plan is for us to have a private build job that builds
new packages anytime the source code in our tree for Bugzilla is
modified.

General Activity:
=================

We continue to explore the package repository service and have folks
from Cassandra currently working on moving their existing deb
repository from www.a.o/dist to this service as a pilot. Traffic about
the pilot has generated interest from other projects.

Addressing a long standing todo that came out of the Heartbleed
vulnerability, we now have an enterprise account with Symantec
that will allow us to provision certs on demand with no interaction
required for apache.org and openoffice.org.

We've migrated a traffic intensive service from one provider to
another to minimize our cloud hosting expenses. (Currently 3/4 of our
cloud infrastructure expense is from egress traffic). The new provider
has a different fee structure that should result in noticeably lower
service charges.

Maven has requested, and we've agreed to host, an ASF copy of the
Maven Central repository from Sonatype. Work is starting around that,
but is still early. This is expected to cost in the range of $600 per
annum.

We began discussing deprecating translate.apache.org as the service is
provided for only 4 projects, and has only one volunteer doing the work
of administering the service. Additionally, a number of l10n services
advertise free l10n hosting for OSS projects. Many of our projects are
already making use of those free offerings.

Uptime Statistics:
==================
We have revamped our uptime charts a bit, added some services and removed some
deprecated ones. Our current overall target is 98.59% for these samples. This
month, the overall uptime was 99.61% with critical services achieving 99.81%.
All of the downtime here was due to moving the writeable git repos to a new
machine.


Type:                   Target:   Reality:   Target Met:
---------------------------------------------------------
Critical services:      99.50%    99.81%        Yes
Core services:          99.00%    99.66%        Yes
Standard services:      95.00%    99.18%        Yes
---------------------------------------------------------
Overall:                98.59%    99.57%        Yes
---------------------------------------------------------



Contractor Details:
===================

Geoffrey Corey:
 - Resolved 26 Jira Tickets
 - Clean up some TLP server realted puppet modules to require no input
    and make sure deployment is a 1 step process (also allows svnwcsub
    use for services such as status.apache.org)
 - Add logic to puppet that deploys Dell OMSA to physicall Dell hosts for monitoring
 - Fix svnpubsub not updating www.apache.org/dist entries (related to svn master rebuild)
 - Coordinate with OSUOSL to replace disk in Arcas
 - Research fpm to build an ASF bugzilla Debian package whenever the source tree changes
 - Create bugzilla puppet module to deploy ASF's different bugzilla instances
 - Complete TLP graduation for Falcon and Flink
 - Various on-call duties

Chris Lambertus:

 - Resolved 2 Jira Tickets
 - On call (xmas)
 - Resolved a number hardware issues with erebus (bad dimm)
 - Installed and coordinated restarts for OMSA on Eirene
 - Cleanup of collectd configuration in Puppet to apply collectd to any system
 - Deployed new status.a.o at RAX
 - Installation, configuration and evaluation of Zmanda and other backup
   tools
 - Initial documnetation of zmanda license count and tally of existing
   storage
 - Oceanus troubleshooting and coordination with Dell/Dell Germany to get
   warranty location updated and parts shipped to the right place (FUB)
 - Initial ESXi configuration to enable WBEM monitoring (eirine)

Gavin McDonald:

 - Worked on 53 tickets closing 34
 - Work on puppetising TLP VMs and Blogs

Daniel Gruno:

 - Assisted Tony in moving the writeable git repos
 - Orchestrated the move of projects.apache.org from Infra to ComDev
 - Improved monitoring of ZFS pools
 - On-call duties
 - Ongoing discussions on svn redundancy setup, corporate offers and DNS setup

17 Dec 2014

New Karma:
==========
Andrew Bayer was added to root@

Finances:
==========
$758.48 Windows License
$761.27 Replacement Harddrives
$1607.78 Service contracts
$1247.95 AWS

Operations Action Items:
========================
n/a

Short Term Priorities:
======================

Codesigning:
------------
A number of projects inquired about codesigning at ApacheCon with
intent to sign up and request access to codesigning.
In the last 30 days no releases have been signed.

Machine deprecation:
--------------------
The failure of the machine underlying the SVN master has caused some
reprioritization of work and evaluating our older physical machines
for the important services that run there. There is ongoing work to
relocate the writable git service as well as the Bugzilla machines off
of the rapidly aging hardware.

Long Range Priorities:
======================

Monitoring
----------
Work continues on monitoring, with several advances being made
The first is work with collectd that is being deployed to all of
our puppetized-machines. This gives us insight into performance metrics
of the machine. Additionally, we've managed to get Dell's OMSA platform
for monitoring the underlying hardware deployed to a number of physical
hosts. This information is being utilized to monitor for things like
failed disks, failed power supplies, and see other overall health
information like ambient and internal temperatures.

We've gotten a platform that integrates PagerDuty, Hipchat, and email
alerts for our Dell physical hardware that has OMSA installed.
You can see the code here:
https://github.com/humbedooh/dsnmp

We are also expanding the radius of hardware notifications from root@
to infrastructure-private. All of this while reducing that down to a
single email per day.

Automation
----------
We continue to make large steps forward in our automation efforts.
The following services are now in configuration management. Several other
are currently in progress.
rsync.apache.org
Git mirrors
Subversion master
status.a.o website

Additionally, some of our work from last months' work around
getting the webserver into configuration management has been
adopted upstream.
https://github.com/puppetlabs/puppetlabs-apache/pull/939

Resilience
---------
We now have a second critical service that we have the ability to
easily replicate and redeploy.

Technical Debt
--------------
We are encountering lots of touchpoints that are tied to specific
machine names, specific file locations on specific machines and
a large chunk of our time is spent in having to decouple this.

General Activity:
=================
This month has been extraordinarily taxing on Infrastructure, with
three large outages or service degradations occurring following
ApacheCon.

Several volunteers and contractors traveled to Budapest to attend
ApacheConEU. In addition to a dedicated track around Infrastructure,
an Infra hackfest table was manned where folks could come and ask
questions or get help. Many folks took advantage of this.

While at ApacheConEU, a number of volunteers and contractors met with
folks from OpenOffice to talk about existing and future needs.
Details of this meeting can be seen in Andrea Pescetti's notes:
http://s.apache.org/aooinframeetingreport

Git mirroring (including git mirroring to github) suffered some disruption
immediately following ApacheConEU. The 6+ year old machine that the
service was running on was shared with 3 project zones, and as the number
of git mirrors has increased over time, service delays began increasing
to the point that the service was non-functional. Initial attempts to
restore service on machine were not successful over the long term and we
ended up moving the service off of the affected host and into AWS. The
underlying machine that the service was running on has been deprecated. We
have sent the projects with zones there notice that we plan to shut down
the machine in the not too distant future. We have begun organizing
replacement resources in lieu of the Solaris zones.

The machine that ran the Foundation's SVN master suffered an outage caused
by a failure of the root filesystem array. The initial public reporting
is at:
https://blogs.apache.org/infra/entry/subversion_master_undergoing_emergency_maintenance
The post-mortem report from this event is at:
https://blogs.apache.org/infra/entry/svn_service_outage_postmortem
We were able to resurrect the service on new hardware with the
configuration residing completely in configuration management. This should
minimize the time to recovery for future issues.
The recovery took approximately 2 days to complete.

We suffered a loss of all network connectivity to services in our colo facility
at Oregon State University Open Source Lab on 10 December. The outage lasted
almost 2 hours. We are still working with OSUOSL, OSU, and NERO to figure out
what happened. A redundant (but disabled) network link was activated to
bring us back online.

While monitoring the newly provisioned webserver, we discovered that
Cassandra is pointing users to a .deb package repository on the main
webservers instead of utilizing the mirrors, as package repositories
won't function with our current mirror offering. After some analysis
we found that this package repository was the source of 15% of all
traffic hitting our webservers.  Our initial thought was to block
that traffic. Doing so would have had a large impact on folks. We
are currently researching options to provide package repositories
so as to remove that load from our main webservers.

Uptime Statistics:
==================
Unfortunately, uptime for critical services this month saw a sharp decline due
to the subversion outage. Also affecting uptime was the brief network outage on
December 10/11, as well as the migration of the git mirrors to a new location.
Overall, due to all other services behaving exemplary, we did experience a
slight increase in overall uptime by 0.03% compared to previous months. Thus,
the total recorded uptime for 2014 (weeks 27 through 50) is as follows:

Type:                   Target:   Reality:   Target Met:
---------------------------------------------------------
Critical services:      99.50%    99.84%        Yes
Core services:          99.00%    99.77%        Yes
Standard services:      95.00%    98.38%        Yes
---------------------------------------------------------
Overall:                98.00%    99.39%        Yes
---------------------------------------------------------

For details on each service as well as average response times, see
http://s.apache.org/uptime


Contractor Details:
===================

Geoffrey Corey:
  - Resolved 37 JIRA tickets
  - Worked on completing the steps to graduate 4 projects to TLPs
  - First round of On-Call duties
  - Rename argus podling to ranger
  - Coordinate with OSUOSL to replace disk in Tethys
  - Coordinate with Henk P. to migrate rsync.apache.org us host off eos and
    onto the AWS TLP server
  - clean TLP off eos to help recover disk space
  - Clean up lingering details with TLP/dist and rsync being migrated off eos
  - Helped in various ways to restore svn master after the failure of eris

Daniel Gruno:
  - Helped to restore the subversion master after the failure had occurred
  - Worked on implementing more extensive SNMP monitoring of capable (mostly
    puppetized) hosts
  - Moved git.apache.org off the aging Solaris box and onto a new puppetized VM
  - Tweaked svn/git-to-github syncing process to cut down execution time from
    2 hours to 8 minutes
  - Helped clean up some TLP snafus in relation to bringing svn templates to git.
  - Various on-call duties
  - Resolved 35 JIRA tickets since last report
  - Fixed various rendering issues with status.apache.org by using asynchronous
    calls

Chris Lambertus:
  - Resolved 9 Jira tickets
  - Helped to restore the subversion master after the failure had occurred
  - On-call duties
  - fixed FUB VPN and rebuilt oceanus as cloudstack testbed
  - resolved ongoing issues with secmail.py
  - resolved major outage due to eirene vmware network failure
  - resolved nightly backup problems with several hosts
  - ongoing prototyping and evaluation of "enterprise" backup solutions
  - troubleshooting assistance with TLP/git/svn puppet migrations and service
    configs
  - documented nexus project creation process
  - ongoing addition of hardware to Dell OME, acquisition of windows license
    to make this a complete service

Tony Stevenson:
  - Worked on the restoration of SVN master, including moving to Ubuntu and
    puppet
  - Started work on moving the git-wip-us service off of again baldr onto a
    Ubuntu VM.
  - Various on call activities
  - Attended Apachecon, where we had several infra sessions and an Infra
    meetup on the Sunday following the conf.
  - Worked on several JIRA issues and some longer term background activities
    like host patching.
  - Conducted the SVN master post-mortem exercise
  - Writing and updates of puppet modules to continue to make them platform
    agnostic

Gavin McDonald:

 - Worked on 75 Jira Tickets, closing 57.
 - Infra commits SVN - 33
 - On Call duties.
 - Work with new contractors on various issues.
 - Updating non puppet VMs/Machines for security updates.
 - More work on reducing cron mails
 - Work started on archive logging for HipChat
 - Restored SVN to Buildbot Hooks mechanism after the move.

19 Nov 2014

New Karma:
==========

Joe Schaefer has resigned from Infrastructure Committee (and root@)


Finances:
==========

$250 for AWS


Operations Action Items:
========================
none

Short Term Priorities:
======================

* Wiki outages - As highlighted in the October Infra report we have been
  running into issues with the MoinMoin wiki. This degradation is caused
  by severe disk IO load on the machine that hosts this service. This is
  complicated by the fact that this same machine also hosts the US web
  mirror for the foundation and projects as well as the mail-archives
  service. Additional fallout has been that publishing website updates
  has tremendous delays for websites on the US mirror. We think that much
  of the sudden IO load increase is due to the machines ZFS-filesystem
  growing to ~90%. Because of Copy-on-write nature of ZFS, and the
  allocation switches that happen when a volume begins approaching
  capacity, performances several degrades. We evaluated all of the
  services on the host, and our initial analysis was that we'd be best
  served by separating the web sites for projects and the foundation.
  During the course of doing that, we've discovered that a large number
  of projects were distributing artifacts and publishing their website
  using a long-deprecated (February 2012 deprecation) method. This, and
  other factors, have complicated the process, but today I am happy to
  report that we now have an easily replicable webserver definition in
  configuration management that allow us to easily deploy any number of
  webserver hosts in short order, and paid off a large amount of
  technical debt in the process, as well as having the newer members of
  our team understand well the entire website process from checkin to
  publication. This has dramatically improved the wiki situation, though
  it has revealed some underlying issues with the wiki that will need to be
  dealt with.


Long Range Priorities:
======================

Monitoring
----------
While we continue to work on monitoring, and have far to go, this
month yielded two major improvments. The first in disk capacity alerting, and
the other in failed array management. This is not yet pervasive in our
infrastructure, but is a start towards that end.

Automation
----------
This month resulted in a large step forward as a number of services
are
now able to be deployed in an automated fashion and in configuration
management. Many of these have been in-process for a month or longer.
to include the following services:
* committers mail-relay
* all project and Foundation websites
* rsyncd.apache.org (mirror distribution)
* host provisioning dashboard
* inbound email MXes

Resilience
----------
The work done around automation has given us our first critical
services that we can easily replicate and deploy multiple. To give you an idea
of scale we can deploy an external email exchanger, completely configured, in
less
than 10 minutes, or all of the project websites in about 3 hours (largely
bound by having to download all of the site content)

Technical Debt
--------------

This month saw us paying back large portions of technical debt. Of particular
interest is the shuttering of legacy means of publishing releases and websites
that were deprecated almost 3 years ago. We also were able to decouple a large
number of very tightly bound services.

General Activity:
=================

It is worth noting that part of the new monitoring systems we have in place
keep an eye on the status of internal disks and disk arrays. On the 30th
October we were notified into our HipChat room that the machines that
services as our svn master had a bad disk in its array. One contractor
went to the datacenter to replace the disk with a spare from our inventory
whilst another contractor configured and onlined the disk, re-adding it to the
pool. A hardware issue being notified to being replaced and back online
 all in the same day.

Code Signing
------------
Two more releases were signed this month, both from Tomcat.

Three additional projects (Logging (Chainsaw), OpenOffice, and OpenMeetings)
are now setup and enabled to sign artifacts, though most are still testing.

Uptime Statistics:
==================
Overall, uptime has seen an increase in 0.30% compared to
last month, putting uptime for the october-november period at a record high
99.82% overall. The total recorded uptime stats since we started measuring it
are as follows (weeks 27 through 46):

Type:                   Target:   Reality:   Target Met:
---------------------------------------------------------
Critical services:      99.50%    99.97%        Yes
Core services:          99.00%    99.77%        Yes
Standard services:      95.00%    98.14%        Yes
---------------------------------------------------------
Overall:                98.00%    99.36%        Yes
---------------------------------------------------------

For details on each service as well as average response times, see
http://s.apache.org/uptime


Contractor Details:
===================

Daniel Gruno:
    Non-JIRA related issues worked on:
    - Explored and implemented a rewrite of our DNS system
    - Miscellaneous help/guidance for new staffers
    - Collated www+tlp server stats for an overview of our traffic/request
      rates
    - Worked on setting up Chaos as a disk array for Phanes (for unified
      logging)
    - On-call duties
    - Assisted Geoff in moving rsync'ed data to svn for web sites
    - Helped tweak httpd instance on tlp-us-east to cope with the request load
    - Tweaked status.apache.org, added hard-coded notice about wiki.a.o
    - Fixed dependency issues with Whimsy
    - Deprecated SSLv3 on all SSL terminators in response to POODLE

Geoffrey Corey:
    Non-JIRA related issues worked on:
    - Finished migrating all tlp sites into puppet and new tlp host
    - Migrated lingering projects using rsync for artifacts distribution to
      using svnpubsub for distribution
    - Clean up retired sites with correct redirects for www.a.o/dist to their
      attic pages
    - Decommission/surplus the old hermes hardware
    - Replace disk in erris

    JIRA related tasks:
    - Resolved 11 JIRA tickets
    - Renamed incubator project optiq to calcite

Gavin McDonald:

 - Worked on 52 Jira Tickets, closing 27.
 - Infra commits SVN - 40
 - On Call duties.
 - Work with new contractors on various issues.
 - Resolve queries from IRC, HipChat and Email (no jira tickets)
 - Updating more Ubuntu machines/vms for Bash vuln.
 - Worked more on upgrading pkgng FreeBSD machines.
 - Continue Work on improving Pass rate of builds, liaising with projects as
   neccessary.
 - Configure new disk into Eris Array.
 - Adding packages to all Jenkins slaves via ansible is now working fine. Work
   started
   on doing the samd for Buildbot slaves.

Chris Lambertus:

  - Closed 9 jira tickets
  - First on-call
  - Created new VM for status.a.o migration to Cloud (RAX)
  - Began work on evaluating backup and disaster recovery processes
  - Noted & resolved problems with zfs on abi causing failed backups
     - upgraded abi to FreeBSD 10.0
     - purged extraneous zfs snapshots
  - analysis and evaluation of tools for improved backups
  - implemented collectd puppet module (monitoring)
  - added circonus monitors for new tlp (monitoring)
  - begin work on oceanus cloudstack eval
  - secmail.py troubleshooting & repair
  - MX incubator list troubleshooting with Tony
  - metis disk replacement PERC troubleshooting with Tony

Tony Stevenson
  - Working on several major priorities:
    - eos - The main US webserver has slowly over time grown it's disk
      capacity
      usage levels - most recently growing over the threshold at which ZFS
      suffers from significant performance penalties.  Disk capacity cant
        easily
      be increased and eos is scheduled for EOL so the short term goal was to
      tidy up the data on disk. This was acheived by moving some of the older
      static data to eris. See work by others on the overall status of
        retiring
      eos. Also see below commentary on a wiki migration PoC.
    - abi - The host in Traci.net (FL) that we have been using for an offsite
      copy of data for a number of years had suddently become increasingly
      unusable and jobs were failing. This was primarily caused by a failure
        in
      removing old data snapshots. This essentially stemmed from the period
      when hermes had to be rebuilt. A lot of triaging of old copies of data
      had to be done this was done in conjunction with others notably cml@
    - hermes - Fixed a long standing issue with a faulty disk on the dungeon
      master that hosts hermes. Also worked on better manaaging the incoming
      mail queue as on occasion it backlogs and has a compound effect on
      genuine mail delivery.
    - chaos (host where ELK is to be part-deployed) this work was delayed
      until
      AC EU giuven the more urgent issues on eos and abi.
    - MX - After the unsuccessful attempt to migrate the MXes to new hosts run
      from AWS EC2, several lessons have been learnt and we have fixed all but
      one of these at the time of writing this report.  The last fix is more
      complicated and needs significant testing to sign off, and then we need
      to expose this change to the mailing lists affected so that they are
        kepy
      in the loop, though we are aiming for a completely transparent cutover
      when we came to implement it.
  - As part of the bigger piece of work to unpick all the services on
    (and dependant upon) eos I have started work on setting up a PoC that will
    host the moinmoin wiki service (wiki.apache.org) in AWS EC2. This is
      making
    good progress and a data synchronisation should begin during AC EU
      allowing
    the team to see it working during the F2F on the Saturday after AC EU.
  - More puppet work creating new and adding 3rd part modules further
    extending
    the puppet managed aspects of our machines.
  - secretary@ workbench issues, this was seemingly related to a corrupt mbox
    file on minotaur. Moving this aside and having clr@ manually process the
    period allowed us to re-enable the automatic service, allbeit at a much
    slower frequency than before.

15 Oct 2014

New Karma:
==========
none

Finances:
==========
Domains: $85


Operations Action Items:
========================
None

Short Term Priorities:
======================

Code Signing
------------
The code signing service is now live. Two projects have successfully shipped
signed artifacts. Two more projects are in various stages of signing up, being vetted
or testing out the service.

Long Range Priorities:
======================

Monitoring
----------
Work has been ongoing to provide monitoring of the underlying hardware many of our
services depend on, this is still in the early exploratory stages and remains ongoing
Work building on our existing base of service status monitoring to provide insight,
is also ongoing.

Automation
----------
Much progress has been made around automation, multiple services are now fully defined
in configuration management, though we still have much to go.

Resilience
----------
Exploratory work to evaluate our ability to recover from disaster is underway, but is
still early.

Technical Debt
--------------


General Activity:
=================

The primary US web server that provides foundation and project websites in addition
to mail-search and the moin-moin wiki has been problematic this week. Specifically
we're running into multiple problems occurring at once resulting in tremendous
IO overhead, and leading to very slow responses or outright failures.

This has been a very busy month from a security perspective. Our Bugzilla instances
went through multiple rounds of patches in response to 5 related security issues.
Additionally, we've spent much time responding to Shellshock.

Uptime Statistics:
==================


Detailed Contractor Reporting
=============================

* Geoffrey Corey

  - Get acquainted with ASF's system and services layout
  - Resolved 7 JIRA tickets
  - Clean out ASF's server racks and old hardware/spare parts at OSU
  - Inventory spare hw at OSU
  - Learn how to do TLP requests
  - Learn how to do svn to gi migrations
  - Setup AOO's mac mini build slave
  - Learn ASF's puppet infrastructure
  - Deploy NTP puppet module
  - Learn how svpubsub and svnwcsub are setup and used, create a puppet module for it
  - Learn how to use AWS to begin migration of TLPs off eos to fix IO issues

* Gavin McDonald

 - Worked on 44 Jira Tickets, closing 30.
 - Infra commits SVN - 59
 - On Call duties.
 - Work with coreyg liasing with hardware to be removed.
 - Installed 2 new machines at OSU, DRACs configured, ready for OS installs
 - Resolve queries from IRC, HipChat and Email (no jira tickets)
 - Work through more Cron noise.
 - Worked with Intervision and organised Warranty renewals/declines.
 - Updating many Ubuntu machines/vms for Bash vuln.
 - Worked on and continiung to work on upgrading the FreeBSD machines. There is much
     work involved as we are breaking free of Tinderbox based updates and going
     direct to the official repositories. At the same time packages/ports are being
     updated (forcibly) to the new pkg system. The machines done so far are showing
     no ill signs as of yet but we havent forced an upgrade of all packages, just a
     few essential ones. I expect as more machines are done, we'll start to see
     some things break. This is an unavoidable one way trip that we'll deal with.
 - Started looking into automating jira project key renaming, in support of dealing
   with when projects rename themselves. The Jira cli plugin looks promising, but
   doesnt seem to support renaming yet (but offers project cloning and deleting).
   Investigating the API directly but that too seems to lack support thus far.
 - Worked on improving Buildbot slaves stability.
 - Worked on improving Pass rate of builds, liaising with projects as neccessary.
 - Restored RAT reports for projects and the RAT master summary pages.
 - In our cwiki, added the ability for projects to access Intermediate HTML for
   diagnosing formatting issues in PDF exports. (https://cwiki.apache.org/intermediates/)
 - Github to Buildbot to HipChat integration, testing github commits to infra repo

* Chris Lambertus
Tickets:
  * Closed N (I don't know what N is) tickets.

Outages:
   * Troubleshooting and resolution for the vmware host outage.
   * ongoing, www.a.o/mail-search/moin-moin troubleshooting

Puppet/Automation
   * Learning puppet
   * Implemented dovecot and SNMP modules

Monitoring:
   * Researched ways to monitor existing hardware that runs FreeBSD.
   * Began needs analysis and some PoC work around monitoring with
Circonus, collectd, SNMP, etc.

Disaster Recovery
   * Initial research into current state and how we might improve our
disaster readiness.

17 Sep 2014

New Karma:
==========
Chris Lambertus (cml)
Geoffrey Corey (coreyg)


Finances:
==========
RAM for VMware hosts  $715 Replacement HDDs  ~$1700
Mac OSX Build Slave $730
Puppet training $1300
Domains: $17


Operations Action Items:
========================


Short Term Priorities:
======================

## Code signing Mark Thomas successfully concluded his testing and we were
able to come to agreement with Symantec. The service has thus far been
deployed with Mark Thomas leading efforts to deliver signed code for Apache
Commons. The Apache Commons PMC is currently voting on release artifacts for
the first signed binaries. Post-completion of this test the service will be
available to any PMC requesting the service.

## Build/CI environment. http://s.apache.org/hDu
### Yahoo has graciously increased the number of machines that they provide
(and provide colo services for) to a total of 20 machines this year. This has
tremendously reduced the pending queue size for our build services.

### Cloud slaves Our RAX cloud environment is now being utilized by Jenkins to
deploy (and destroy) machines on demand in response to load. Additionally,
we’ve made a RAX account available to the Gora PMC for twice yearly testing
they plan to engage in.

Long Range Priorities:
======================

* Monitoring We are beginning to get insightful information out of monitoring.
We now have a mail loop that provides information on the cycle time from
sending to mail reception. Additionally we now have started monitoring some
elements of host storage. Centralized logging is making slow progress but has
a plan with a time table.

* Automation The base level framework for machine automation is complete; and
that work is expanding. As we begin to need to break services out we are
building them with puppet. Additionally work to programmatically have JEOS
machines for bare metal as well as virtualization and cloud targets is
progressing nicely with most of that work expected to be wrapped up by end of
month.

* Technical Debt/Resiliency Some work has happened identifying long
complaining error conditions in a number of processes and resolving them;
currently focused on errors around backup scripts.

General Activity:
=================

* Welcomed two new contractors, Chris Lambertus and Geoff Corey, to the fold.
* We’ve dealt with a unusually high number of failed hardware issues this
month.
* Sourceforge has reached out to infra regarding migrating Apache Extras to
Sourceforge.
* The machine that houses our US web server (for www.a.o and $tlp.a.o) as well
as mail-search and the moin-moin wiki has experienced tremendous IO load. Work
is ongoing to breakout those services and reduce total IO load for any given
machine. This has been noticeable to end users in the form of wiki slowness
and updates to project websites being slow on the US website.
* repository.apache.org suffered a severe service degradation that resulted in
many projects being unable to publish artifacts to Nexus for several days. For
details see: http://s.apache.org/H2f
* We’ve found a number of processes that infra executes that appear to be tied
to being listed as a member in LDAP. We’re working to resolve that issue and
tracking it in INFRA-8336

Uptime Statistics:
==================

Targets remain the same in the last report (99.50% for critical, 99.00%
for core and 95% for standard respectively).
These figures span the previous reporting cycles as well as the
present reporting cycle (weeks 34-38). Overall, the figures have gone up
since the last report, and we are continuing to meet the uptime targets.

Type:                   Target:   Reality:   Target Met:
---------------------------------------------------------
Critical services:      99.50%    99.97%        Yes
Core services:          99.00%    99.77%        Yes
Standard services:      95.00%    97.46%        Yes
---------------------------------------------------------
Overall:                98.00%    99.16%        Yes
---------------------------------------------------------

For details on each service as well as average response times, see
http://s.apache.org/uptime


Detailed Contractor Reporting
=============================

* Daniel Gruno:
 Work done since past report:

 - Cleared 30 JIRA tickets. See those for additional details.
 - Helped introduce Chris to his new job. This included setting up his account,
   putting it into the correct staff groups, assigning some easy JIRA tickets
   to get started with and walking him through the process of resolving these
   tickets.
 - Fixed some mailing lists mistakenly marked as private. This seems to be a
   reoccuring problem, so we will need to tighten our mlreq page and make it
   harder to create a private list.
 - Created new mailing lists for Reef.
 - On-call duties.
 - Started work on the Infrastructure presentation for ApacheCon EU.
 - Discussed doing a "Git at the ASF" talk with David at ApacheCon, as we have
   a free slot.
 - Dealt with Freenode's security breach (mainly rerouting some IRC services to
   the EU and resetting passwords).
 - Started work on resolving the current issues faced by non-member staffers.
   This will likely take some time to finish, and involve several people. Our
   first priority should be getting a new ACL set up for browsing the mail
   archives, so root has acces to this data. This is a sensitive operation, but
   one that should be well covered by the confidentiality clause in staffers'
   contracts.
 - ELK stack is progressing, storage setup expected to be done this week, at
   which point we will be able to start pointing some of the heavier services
   to it.
 - Answered queries from Joe Brockmeier re Hadoop moving to Git and the new
   status page.
 - Helped EVP and fundraising with the new Bitcoin donation methods (and
   answered queries on that).
 - Added commit comment integration with GitHub. This is still a work in
   progress, and I plan to rewrite then entire integration system when time
   permits.
 - Moved some VMs around in response to prolonged downtime on Erebus due to
   disk replacements. This resulted in minimal downtime for services (a few
   seconds at most).
 - Finished work on the subscription service for our monitoring of local
   project VMs.

* Gavin McDonald:
 Work done since last report:

 - 68 Jira tickets closed.
 - 7 Jira tickets closed were Hardware related repairs - Disks, PSU and Memory
   The hardware situation is much better. Still a couple to resolve.
 - More Jenkins work done, Ansible issues determined but the slaves are
   unreliable at present. Still have no OOB access to them and 9 times out of
   10 if a reboot is needed the slave doesnt come back. This situation is
   only tollerable for a certain perios and that is nearly up. David is in
   talks to get more slaves available.
 - Buildbot has been worked on some more, it got left behind due to other work
   but is now getting some love once more. There is one major nag in that some
   slaves (and seem to be only the new ones) are failing randomly with xml
   corruption failures even though the checkout performs fine. Testing shows
   that the xml isnt being returned (but only some of the time.).
 - There are plans to upgrade Buildbot Master this month.
 - There are plans to upgrade Confluence (accross several versions) this month.
 - On Call duties
 - Working through reducing cron mails

20 Aug 2014

New Karma:
==========
 none

Finances:
==========
RAM for VMWare host:        $690.79


Operations Action Items:
========================

Short Term Priorities:
======================

* Code signing
  Mark Thomas has successfully concluded his testing of Symantec Application
  signing service. Subsequent to that, he's identified a workflow that should
  work for our many projects. Conversations with Symantec on pricing are
  ongoing.

* Response timeframe targets:
  It was agreed to set up three distinct timeframes for responding to incidents;
    1) For critical services, incidents should be responded to within 4 hours
    2) For core services, incidents should be responded to within 6 hours
    3) For standard services, incidents should be responded to within 12 hours.

  The response need not be a resolvement of the issue, but needs to include one
  or more of the following steps;
    1) Acknowledging the incident through internal channels (See PagerDuty
       et al)
    2) Communication of the incident to the involved/affected people in
       accordance with the new communications plan laid out by the VP.
    3) Delegation of the issue to a member of infrastructure whenever possible
    4) Tracking of the incident (method depends on the duration and gravity of
      the incident)


* On-call rotation:
  At the f2f meeting in Cambridge, it was decided to introduce an on-call
  rotation between contractors. Each week, a contractor will be assigned as
  being on-call, and will be responsible for either resolving, delegating or
  communicating about outages, account-, mailinglist- and tlp-creations, as well
  as planned changes, and security issues. To the extent that this is possible
  (not counting sleep), incidents must be responded to within the new target
  timeframes, as explained in the previous paragraph.

  In the time since going live with both an on-call rotation and tracking
  response time in the response timeframe, the responses have been
  dramatically faster than the service level expectations. As we build up
  staff numbers and spread geographically a bit, the expectations may change.

* Improved response for and analysis of Java services at the ASF: At the
  previously mentioned f2f meeting, contractors were introduced to a detailed
  course of analysing and reporting incidents with Java applications run by
  infrastructure. We expect this new information to be extremely valuable in
  reaching and maintaining the target uptime for Java services. The staffers
  would like to extend a very big thank-you to Rainer Jung for his services in
  this matter.


Long Range Priorities:
======================

* Monitoring
  ** Uptime monitoring and responsibility:
     In addition to the previous board report, a new service level agreement was
     made between infrastructure staffers, increasing the targets for uptime on
     critical and core level services, as described below in the statistics
     paragraph. Ensuring that services meet the new targets have been made one
     of the cornerstones of infrastructure's work. The monitoring of public
     facing services has been outsourced to a third party (free of charge), and
     we will be focusing on having Circonus produce metrics for our inwards
     facing services/devices, such as LDAP, PubSubs, SNMP etc.

  ** Unified logging:
     Experiments with unified logging is proceeding as planned, with more and
     more hosts being coupled into the new logging system. A filtering
     mechanism for the lucene-based backend has been created, allowing anyone
     to use the logging service based on their LDAP credentials. As such,
     anyone with access to a specific host (as defined in LDAP) will be able to
     pull logs from the unified logging system. We are confident that this will
     make debugging and analysis easier, to the point that we are disabling
     older alerting/information systems and using the logging system to fetch
     information that would previously have been sent via email to root@.

* More virtualisation; Better use of what resources we have:
  It was decided to move towards more use of virtualisation for many critical
  and core services, including our main web sites and wikis. This will allow
  us to better respond to incidents and resolve them without affecting other
  services. Furthermore, it is our belief that we can free up resources by
  switching to a virtualised environment, thereby possibly getting more
  space for the crammed-up project/service VMs.

* Automation
  ** Cloud-based dynamic build slaves have been in progress for a bit. Much of
     the work around this has been driven by Dan Norris. Building on a framework
     of repeatable builds he and Gavin McDonald have been successfully spinning
     up on-demand build slaves with our RackSpace account. This also relates to
     our goals around configuration management, and configuration of the machine
     is in Puppet. Expect to see this service go into production in the next
     week or so.

  ** Puppet - the scope of puppet deployment continues to edge forward. Gavin
     McDonald attended training just before the Infra F2F meeting.

* Technical Debt and Resiliency Work around uptime monitoring and actually being
  able to better understand where the pain points are, coupled with some of the
  knowledge we gained at the Infra F2F has allowed us to focus on long term
  adjustments rather than hasty short term restoration of service. You should
  see this reflected in the uptime statistics

General Activity:
=================
- A face-to-face meeting between infrastructure members was held in Cambridge,
  UK.
- 27 new committer accounts created, 8 new mailing list (TBC)
- 3 projects were promoted to TLP
- 193 JIRA tickets resolved (since last report)
- A new status site was launched by Infra at status.apache.org
- PagerDuty has donated a gratis account for up to 10 users to ASF Infra

Uptime Statistics:
==================
Due to new a SLA between contractors, the targets for critical and core services
have been updated to reflect the new criteria (99.50% for critical and 99.00%
for core respectively). This represents an overall increase of 0.57% uptime
across the board. These figures span the previous reporting cycle as well as the
present reporting cycle (weeks 29-33)

Type:                   Target:   Reality:   Target Met:
---------------------------------------------------------
Critical services:      99.50%    99.94%        Yes
Core services:          99.00%    99.81%        Yes
Standard services:      95.00%    96.83%        Yes
---------------------------------------------------------
Overall:                98.00%    98.99%        Yes
---------------------------------------------------------

For details on each service as well as average response times, see
http://s.apache.org/uptime

Detailed Contractor Reporting
=============================

* Daniel Gruno

  - Resolved 51 JIRA tickets
  - Worked on a new status site for public ASF services
  - Worked on the ELK stack, set up on phanes/chaos.
  - On-call duties
  - Worked with Gavin and Tony on OpenSSL CVEs and general VM upgrades
  - Fixed issues with svn2gitupdate not working
  - Worked around a GitHub API change that had invalidated our integration
    measures
  - Miscellaneous upgrades to ASFBot
  - Continued work with uptime monitoring and reporting
  - Worked with Fundraising to produce statistics about the ASF
  - 8 days of vacation.


* Tony Stevenson

  - Resolved 74 issues
  - Took part in the bugbash
  - On-Call rotation
  - FreeBSD/Ubuntu SSL CVEs.
  - Started work to disable swap across all VMs
  - Investigations into BigIP/F5 etc
  - Fixed issues with VPN applicance
  - Several disk replacements
  - Run down several repeat cron error messages
  - Some further conversations with others about puppet
  - Setup trial of lastpass with a view to possibly replacing our GPG files.
  - Instigated the trial of hipchat. With a view to seeing if we coild deprecate
    IRC.

* Tony Stevenson - Comments

  Hipchat:

  For a long time I have been thinking about trying to find a better way to
  engage with some of our users. Also, I was hoping to find a way that we could
  get a better feed of information that was more relevant and pertinent.

  We have hooked it up to JIRA, Github, Pagerduty, and PingMyBox. These all
  provide near realtime information that we can act on.

  The more modern service, perhaps will be seen to be a move on from some of our
  older roots. Which might appeal to others. With the move we have also seen the
  SNR improve significantly enabling better communication across the team.

  The alerting with Hipchat allows people to be notified of communcations that
  involve them, via push messages to a phone/tablet etc. Also once you return
  online from an offline state you see all the history. The history is
  searchable.

  You can join us, here, https://www.hipchat.com/gw4Cfp7JY

  Private channels can also be created, for those who need a channel that need
  to control access. Think #asfmembers etc

* Dan Norris

  - Built machine image automation using Packer
  - Packaged (using FPM) many of the unpackaged build tools (Provides
    repeatable, known installation; allows us to query for status and version)
  - Deployed a DEB repository in RAX CloudFiles for packages
  - Puppetized the build slave configuration
  - Documented the process of building a machine image, uploading it to RAX
  - Using jclouds plugin for Jenkins, successfully provisioned dynamic build
    slaves.

16 Jul 2014

New Karma:
==========

Dan Norris (dnorris)


Finances:
==========
* 64GB RAM for Arcas:       $777.16


Operations Action Items:
========================


Short Term Priorities:
======================

* Signed binaries
  Mark Thomas has made significant progress in his efforts with Symantec around
  using their Binary Signing as a Service product. I have high hopes that we are
  near a proposed solution.

* builds.a.o
  Much work has been done around builds.a.o; and it's largely stabilized. The
  past month has yielded 99.92% uptime. That's a far cry from the routine
  outages that were happening on average once a day.

Long Range Priorities:
======================

* Monitoring
  ** Uptime monitoring and reporting:
     As an extension to defining core services and service uptime targets,
     Daniel has begun compiling weekly reports of uptime for most of the
     publicly facing services. These reports will in turn be compressed into
     monthly reports for the board as well as a yearly report detailing the
     overall uptime reality vs our set targets. Eventually, these reports will
     also feature inward facing services.

   ** Unified logging:
      Discussion and exploration has begun on unifying logging on all VMs and
      machines. The logging will be tied to puppet and allow for easy access to
      each hosts logs from a centralized logging database, as well as allow for
      cross-referencing data. Initial exploration into using LogStash with
      ElasticSearch and Kibana have begun, and are expected to produce findings
      for use in the next board report.

* Automation
  Tony has expended effort and time in deploying a more updated, platform
  agnostic base for puppet. Giridharan Kesavan and Gavin have been
  experimenting with using Ansible for build slave automation. Dan Norris
  has begun work on automating VM/cloud provisioning

* Technical debt
  Gavin began addressing cruft in many of our automated jobs; this will be a
  long term effort, but that work is underway and already yielding benefits

  In some ways, we are just beginning to collect information to let us know
  where we stand, and exactly how much debt we have accrued. The uptime reports,
  and comparing that with our first pass at service level expectations has
  started occurring.

* Resiliency
  Our efforts around resiliency are still nascent. We have begun to address a
  few issues caused by resource constraints, though this is a very minor attempt
  to provide true resilience. As other efforts in our long term priorities take
  shape, I expect that we'll begin to see this accelerate.

General Activity:
=================

* Dealt with yet another batch of OpenSSL CVEs affecting all hosts.
* Upgraded Arcas (JIRA host) with 64GB RAM to deal with slow response times.
  This has greatly reduced the response time from Jira. See screenshot detailing
  that change: http://people.apache.org/~ke4qqq/ss.png
* Welcomed a new contractor, Dan Norris, to the fold.
* Face-to-face meeting in Cambridge between infrastructure people
* Created 23 new committer accounts, 4 new mailing lists


Uptime Statistics:
==================
These figures currently span weeks 27 and 28 of this year, and only cover public
facing services.

Type:                   Target:   Reality:   Target Met:
---------------------------------------------------------
Critical services:       99.00%    99.98%        Yes
Core services:           98.00%    99.84%        Yes
Standard services:       95.00%    92.71%        No[1]
---------------------------------------------------------
Overall:                 97.43%    97.80%        Yes
---------------------------------------------------------

For details on each service as well as average response times, see
http://s.apache.org/uptime

[1] The target for standard services was not met due to our Sonar instance
    being unstable at the moment and only having around 50% uptime. We are
    investigating the issue.


Contractor detail:
==================


* Gavin McDonald

Short term Jobs worked on this week:
=============================

  Jira.
  ------

Jira tickets worked on [12] See: jql query 'project = INFRA AND updatedDate >=
'2014/06/16' AND updatedDate <= '2014/06/22' AND assignee was ipv6guru ORDER
BY updated DESC'

Jira Tickets Closed [10] See: jql query 'project = INFRA AND resolutiondate >=
"2014/06/16" AND resolutiondate <= "2014/06/22" AND assignee was ipv6guru
ORDER BY updated DESC'

My Open Jira Tickets [34] See: jql query 'project = INFRA AND status != Closed
AND assignee was ipv6guru ORDER BY updated DESC'

Commits made Infra repo: 22

June 16th saw planned downtime at OSUOSL. The downtime window was 2 hours
between 11am UTC and 1PM UTC. Both myself and Daniel Gruno covered this outage
window and also at least 2 hours before and after the planned window. Actual
downtime we saw was 2 minutes at 11:55am.

Ongoing answering of queries on the infra@ and build@ mailing lists, including
quick resolutions to issues raised. The same goes for IRC - Channels open at
time of writing are:
  #sling #asfboard #jclouds #asfmembers #asftac @#abdera #avro #osuosl
  #+#buildbot #asftest @#asfinfra

Worked on various buildslaves of both Buildbot and Jenkins, updating,
upgrading, patching for SSL etc. Worked on upgrading SSL for several other
VMs, at the same time taking the time and opportunity to
update/upgrade/dist-upgrade and reboot.

Ongoing Medium Term Jobs:
=======================

  Dell Warranty Renewals.
 ------------------------------------

   Involves Liaising with various Dell Reps via email. Service Tags have been
   99% been brought upto date and documented in the service-tags.txt file. Make
   decisions on warranty renewals based on age and whether it is in our plan to
   renew the machine within the next 9 months. Get quotes for and give the go
   ahead to Dell for those we intend to renew. The current email noise from Dell
   regarding these is quite high so this is a task I intend to complete over the
   next few weeks - to either renew, or decline and stop renewals emails.

  Root Cron Job Emails.
 --------------------------------

  Involves sifting through root@ Cron emails from various machines and vms.
  Determine the current important ones that can be assessed and fixed to
  completion. Previously, this was just 'done' and perhaps followed up with an
  email reply to a cron job in question. For better visibiilty and reporting, I
  have now started creating Jira Tickets for these tasks; and also given these
  tickets the 'Cron' label. I expect to make steady progress and have the cron
  mails halved at least over the next 3 months.

  See JQL Query: 'project = Infrastructure and labels = Cron'

Confluence Wiki.
-----------------------

Confluence needs an upgrade. Test instance is in progress. I hope to have this
done in the next couple of weeks.

Ongoing Longer Term Jobs:
======================

  Jenkins/Buildbot
  ----------------------

Some time has been spent improving the stability of Jenkins Server and its
Slaves. With thanks Mainly to Andrew Bayer recently the Server has improved
dramatically. The slaves have seen improvement in stability and uptime too,
including the 2 windows machines. I have spent a fair bit of time recently on
these. I need to create new FreeBSD and Solaris slaves for Jenkins. The former
I think we can achieve in the Cloud whilst the latter I don't think is
supported at RackSpace, investigating. Might need to create our own VM image
for it. At the time of Writing, 34 Builds are in the Jenkins Queue, mostly
attributed to these missing two slave OS flavours and also Hadoop jobs.

Buildbot stabilty is just about back to normal after I rebuilt the Master from
scratch on a new OS Freebsd 10 (prepped by Tony). The forced upgrade of the
Buildbot Master version itself also caused some instability for a while due to
configuration upgrades required. This affected just about all projects using
Buildbot and the CMS. I note that the Subversion project has indicated that a
Mail should have been sent to the Subversion PMC about the downtime suffered
by the Subversion project as a result of the code changes required by the
forced upgrade. Following this advice, I'd have had to email another 30+ PMCS
also telling them the same thing. I find that my generic email to the infra
list should have been enough information for all parties concerned.

  Cloud for Builds.
  ----------------------

  Rackspace - A test machine has been created. Jenkins has yet to make use of
  this however and I'm in progress of working out the best way to integrate with
  our systems - do we use LDAP, Puppet etc with it or create a custom image we
  can replicate. I'll also be starting work soon on a Buildbot test instance for
  on demand.

  Microsoft Azure - A test machine with windows server 2012 is up and running
  and I have access. I am in progress of making changes to this image to make a
  baseline so that the Azure team can replicate several more once I have it
  right. Once done for Jenkins I'll do the same for Buildbot; and make sure to
  leave 2 or 3 instances available for general project use, which I'll advertise
  as available once ready.

  Puppet - Have completed online pre-training puppet course as advised by David,
  using a Vagrant instance via VirtualBox. I continue to invest a couple of
  hours a week in looking through the Puppet Labs online and Documentation. I
  continue to investigate the best methods of integrating the Jenkins and
  Buildbot Slaves with Puppet, though I'm really in a waiting pattern for our
  puppet master to be upgraded to v3.


* Tony Stevenson

Took two weeks of vacation

Having spent a considerable amount of time trying to make a new Puppet3 master
on a FreeBSD box this however did not pan out - there were far too many little
changes from a standard deployment needed and we were still having ssl issues
with puppetdb.

A new Ubuntu VM has been built as the new puppet master and is now about done.
One more test to run tomorrow.

Spent a little bit of time on-boarding jake into root@ activities (a/c
creation etc).

Issues with Erebus VMware host. Needed a reinstall of the vsphere agent and
reconnecting to the management console.

New infra-puppet GitHub repo


* Daniel Gruno

Work log for Week 27:
=====================
  - Create mailing lists for new and existing podlings
  - Access to metis+eris for jake.
  - Set up svnpubsub/cms for new podlings
  - Evaluate ELK stack (ElasticSearch, Logstash + Kibana)
  - Work on factoid features for IRC
  - Ordered 64GB RAM for Arcas (8x8GB, replacing 7x4GB)
  - Upgraded hardware on Arcas (JIRA host)
  - Set up dist areas (some requests proved invalid)
  - Investigate and fix database issues with ASF Blogs (twice)
  - Monitor and compile uptime records for core services over the
    last week.

   (The majority of my time was spent evaluating and tailoring the ELK
    stack, as well as the math fun with semi-automating uptime reports.)

Work log for Weeks 25 and 26 (sans JIRA tickets):
=================================================
  - Updated ASFBot with some minor bugfixes and feature additions
  - Worked on Git mirroring between ASF and GitHub (aka svn2gitupdate)
  - Assisted in applying web server updates for projects
  - Design discussions with Jan and Gavin about Circonus monitoring (still
    ongoing, awaiting results of initial test)
  - Discussed GitHub PR usage with the Usergrid project
  - Investigated and solved an issue with JIRA not responding
  - Worked on updating OpenSSL on all affected machines (CVE-2014-0224 et al,
    ~95% done, should be done by the end of this week (ceteris paribus))
  - Worked on an issue with nyx-ssl and puppet (still unresolved)
  - Worked with Gavin to monitor and respond to OSUOSL network upgrades.
    Resolved.
  - Helped projects tweak settings for IRC relaying of commits/JIRAs
  - Worked on anti-spam measures for modules.apache.org(still under infra's
    umbrella)
  - Worked with Dave to resolve the blogs 404 issue. Resolved in week 27 by
    Brett Porter.

18 Jun 2014

New Karma:
==========
* Jake Farrell (jfarrell) was added to root
* Andrew Bayer (abayer) was added to infrastructure-interest

Finances:
==========

Discovered a past due bill from Dell based on Justin Erenkrantz getting
collection phone calls. ~$1300

Placed order for 2 servers, totalling nearly $17,000

With help from EA, arranged for travel for a F2F as well as
travel for contractor training; thus far that has cost ~$9193


Operations Action Items:
========================

None at the moment

Short Term Priorities:
======================

* OSU Hardware failures Work continues on hardware failures at OSUOSL -
replacement hardware has been ordered and shipped, work continues on getting
it swapped in while minimizing outages.

* Outage remediation Much work continues from the action items drawn from the
post-mortems.

* Builds.a.o We've received a good deal of help from the Jenkins community in
finding and dealing with issues.


Long Range Priorities:
======================

* Monitoring Circonus has now replaced Nagios as our monitoring system with
lots of help from Jan Iversen.
While we still have a very long way to go, the system is already proving
useful; having alerted us
to a number of issues.

* Automation Slow progress continues on rolling out configuration management
in efforts to make our infrastructure better documented and more easily
reconstructed.

* Technical Debt We have begun publishing/discussing early drafts of documents
around expected service levels as well as a communications plan; which are
very early steps in beginning to prioritize work around our technical debt.
See:
https://svn.apache.org/repos/infra/infrastructure/trunk/docs/services/LEVELS.t
xt and
https://svn.apache.org/repos/infra/infrastructure/trunk/docs/vp/comms_plan.txt

* Resiliency Discussions around resiliency have started; but are still
nascent.

General Activity:
=================

In the month of May Infra had 194 tickets opened, and closed 158 tickets in
Jira. For the month of May, Jake Farrell closed the largest number of tickets
with 56.

highlights include:

* Dealt with emerging DMARC issue and blogged about it at

 https://blogs.apache.org/infra/entry/dmarc_filtering_on_lists_that

* Rewrote our qmail/ezmlm runbook documentation to bring it up to date.

* Raised potential UCB issues with our current organizational usage of
committers@. See INFRA-7594 for background.

* Dealt with a crop of openssl-related security advisories.

* Dealt with two as-of-yet unpublished security vulnerabilities.

* Published a blog entry on the mail outage postmortem:

 https://blogs.apache.org/infra/entry/mail_outage_post_mortem

* Confluence was patched after advance warning from Atlassian before they went
public with a security vulnerability.

* During the course of compiling an inventory for Virtual and adding in our
cost to purchase those units, we discovered that 5 machines were not
in our inventory[1]. Three of those machines were either unutilized or
underutilized. This will likely reduce some of our expected hardware spend as
they were relatively recent purchases.
[1]http://apache.org/dev/machines.html

* Enabled emails sent from apache.org committer addresses (or any addresses in
LDAP) to bypass moderation across all apache.org mailing lists.  No changes to
SPF records for the foreseeable future.

21 May 2014

New Karma:
==========

Andrew Bayer (abayer) was granted jenkins admin karma.

Finances:
==========

Infra spent or authorized to spend almost $3300 thus far in the new fiscal
year; all related to replacement hardware or service for hardware.

Operations Action Items:
========================

None

Short Term Priorities:
======================

* OSU Hardware failures We have a number of hosts that have degraded or dead
hardware in our Oregon colo. This is mixture of machines that are in and out
of warranty and involves machines that host both core services and less
important machines. Status is being tracked at:
https://pad.apache.org/p/osustatus

* Outage recovery Coming out of our outages we have substantial number of
remediation items. In some cases the service has been restored but is not back
to pre-outage levels of operation.

* Builds.a.o Stabilization of Jenkins is a primary concern. Much work has
happened from volunteers and contractors alike (see comments below in general
activity as to improvement.) We are still suffering from service failures
every couple of days at this point.

Long Range Priorities:
======================

* Monitoring Our monitoring still lacks the level of insight to provide
operationally significant information. Work continues on this front. Our new
monitoring system (Circonus) should come online in the next few weeks; but
much remains to be instrumented for it to be truly useful.

* Automation Slow progress continues on rolling out configuration management
in efforts to make our infrastructure better documented and more easily
reconstructed.

* Technical Debt Work is ongoing to prioritize services infrastructure
provides and to set expectations and service levels around the services.

* Resiliency I wish that I could say that much work has occurred here; but
most of the month has been focused on outage recovery. The beginnings of that
work has taken place in working to restore a stable platform.
(see the note around hardware at OSU)

General Activity:
=================

Infrastructure suffered three major outages in this reporting period. The
first involved the Buildbot host and a disk failure. CMS and Buildbot project
build were down for several days while the machine was rebuilt. The second
outage was the blogs.a.o service. You can see the details and remediation
steps that are being taken here: https://s.apache.org/blogspostmortem The
third was a 4 day outage of our mail services. You can see the results of the
post-mortem here: http://s.apache.org/mailoutagepostmortem As of this writing
there is still a significant backlog of email being processed.  At current
rate, we expect the backlog to be cleared by May 16th.

The Buildbot host aegis lost a disk also and the machine was rebuilt over a
few days, changing from Ubuntu to FreeBSD 10. The CMS and project builds were
down whilst this happened. At the same time the Buildbot Master version was
upgraded to the latest release which caused some tweaks to the code and
project config files.

Infra has noted an increased level of concern regarding the CI Systems and in
particular the Jenkins side of builds. Some projects are concerned about the
level of support that Infra gives these systems. A combination of factors over
the last months has seen a decline in support - other higher priority services
taking up time, a decline in volunteer time, an increase in projects using the
systems and in parallel an increase in build complexity, all making for a
decline in available resources due to slave increases not happening in a
scaled manor to match. All this is being resolved as we speak and improvements
are being made; and there are many plans for the short/medium and long
term. The work done already is showing progress. In example on 2 May the
average load time for builds.a.o was 72.69 seconds and the average number of
builds in the queue was 65. On 13 May the average load time is down to 1.86
seconds and the average number of jobs in the queue is less than 4. Much work
remains to be done. For data see: http://people.apache.org/~hboutemy/builds/

Plans to address existing issues:
==================================

As has been noted infra ran into a number of problems bringing a number of
key services back into use. There are a number of planned steps that are
either remedial or work around building a more robust foundation. All of
these tie back into the long term priorities you see above.

Below are things I have requested one or more contractors:

* SLAs - We're dividing up services into various criteria. Failures happen,
but our level and rapidity of response as well as the degree to which we
engineer for failure must be measured against how critical the service
is. The current plan is to submit the finished work to the President for
review and discussion with an audience he deems appropriate.

* Prioritization of hardware replacement - New hardware doesn't guarantee
against an outage. However, continuing to test the mean time between
failure for underlying hardware tends to increase risk on average. Along
with this prioritization; I've asked that each of the services being
replaced be done by a person who isn't the 'primary' for that service.
That list is not yet complete, but is being worked on.

* Documentation - Currently the quality varies from service to service.
Some of our documentation is clearly out of date, some is decent. My
experience is that most documentation suffers from bitrot in any
organization. However; I've requested multiple folks to bring our docs
for various services up to a usable state. I've also requested for
folks other than those who produced or will be producing the documentation
to review the documentation and use it to ensure it is accurate and adequate.

* Backups - In general, our backups, where they've been happening, have been
sufficient. We've already had work around documenting restoration from backup
get committed to our docs in SVN. Additionally short term tasks have been
handed out about establishing, verifying, or restoring backups as well as
checking that against the services and documentation. There's also tasks in
place to work on speeding up our restore timelines.

* Automation - We possess a lot of operational automation (scripts and other
tools that allow us to create or subscribe to lists, create users, etc.) We
have bits and pieces of infrastructure automation - but it's not widespread.
In the three outages we've experienced catastrophic failure of the hardware
resulting in the need to rebuild the service from scratch. Virtually all of
the moving pieces involved manual processes from OS installation to service
configuration. That dramatically increased our time to recovery; as well as
being prone to user error. To that end; I've requested the following:

 - Consolidate the number of platforms we support for core services. We
   currently have Solaris, Ubuntu, two major versions of FreeBSD. I've
   asked for a single version of Ubuntu and a single version of FreeBSD
   to be adopted across all of our non-build and non-PMC infrastructure.

 - Deploy an automated OS installation tool -  During the mail outage we
   had to get smarthands in the datacenter to burn a OS install DVD and
   deploy a fresh operating system twice. This meant that a ten minute
   task turned into more than hour in each case. I've set the criteria
   that we be able to deploy our installs over the network and control
   booting and other functions via an out-of-band management tool such
   as IPMI. We must also be able to host our own package repositories.

 - Configurations management -  We currently have puppet deployed but it
   isn't widely used within our infrastructure. Puppet permits you to
   declare state in it's domain specific language that controls how a
   machine is configured; what software is installed as well as collect
   data on the machine itself. Puppet also enforces state; and this
   enforcement is, quite frankly, better than documentation. Even if a
   machine is completely destroyed, by having done the work in puppet we,
   know the exact state of the machine and can deploy that exact
   configuration back to a new machine in a matter of moments. To that end
   I have planned the following items:

       - Training - Most of the infra contractors have not used puppet in
         anger. Beginning in the next few weeks; they'll make use of some
         gratis online training from Puppetlabs with plans for attending
         a hands-on class within 6 weeks. (budget-willing)

       - Mandatory use for new services. I've asked that all new work and
         services being stood up must be done using puppet.

       - Service restoration. For core services that have failed recently.
         we've either updated documentation or have tasks to do so. I've
         requested tasks for translating that into puppet manifests to
         dramatically reduce our mean time to recovery. For services that
         will move to new hardware; if that involves the recreation of the
         service I've asked that be done via config management as well.

       - Base OS deployment. The base OS deployment at the ASF is very
         well documented. In the case of FreeBSD it's ~26 individual
         manual steps that must be executed every time. In conjunction
         with work on an automated OS install; I've asked that all of the
         base OS deployment and configuration be automated via puppet.

 - Monitoring - put simply, our monitoring does not currently provide enough
   insight. In example, we did not know about the failing hardware underlying
   our mail service. According to our monitoring, things were fine. This isn't
   to say that knowing about it would have prevented the outage, but I would
   at least like the advantage of timely knowing about it. As mentioned
   elsewhere; when smarthands were working on our equipment they noted
   that many of our servers were complaining about hardware problems.
   Monitoring is largely grunt work; knowing what to monitor for each service
   is something that the contractors can rattle off. Actually setting up
   monitoring is a large time sink. We currently have a volunteer doing a good
   chunk of work; and my plan is to temporarily (3-6 month timeframe)
   supplement that with an outside contractor who is already familiar with
   our monitoring system and puppet.

None of the above prevents failure. It might give us an edge in detecting that
a failure is about to occur, or permit us to drastically reduce our time to
recovery; but it does not actually keep bad things from happening. The longer
term piece of this puzzle is to begin engineering our most important services
to be more redundant or more fault tolerant. Most of our services are not
setup this way. Our first target is going to be the mail service; we are doing
this for two reasons. First our experience with the mail backlog and the hoops
we had to clear to empty that backlog suggest that we aren't very far from the
limits of our current architecture. Second, as you've seen in the past few
weeks, a mail outage is absolutely crippling for the Foundation.

That said, please understand, that the problems we have, are not going to be
solved in the short term. By the end of the quarter I hope to be able to
report that we have a good start on these initiatives, but this is a long term
effort. Unless luck intervenes it's almost inevitable that we'll suffer
another outage this year.  Hopefully we'll be in a better place to respond to
those outages as we go forward.

16 Apr 2014

New Karma:
==========


Finances:
==========
No purchases/renewals for the month since last report.

Operations Action Items:
========================


Short Term Priorities:
======================

* Look into mac build slaves.

* Converge on git.apache.org migration to eris. (Step 1 is merge git ->
 git-wip on tyr) (opinions?)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken
 place.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.
 (Support case closed, nothing useful came from it other than check the logs.)

* Port tlp creation scripts over to new json-based design on whimsy.


Long Range Priorities:
======================

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

* Institute egress filtering on all mission-critical service hosts.


General Activity:
=================

* New 3-year wildcard SSL cert purchased and installed for *.openoffice

* Thrift migrated to CMS with the aim of providing better support for similar
 sites.  Blog entry here:

     http://blogs.apache.org/infra/entry/scaling_down_the_cms_to

* The number of confluence administrators was significantly reduced, this was
 to try and keep the list as small as possible.  Historically this permission
 level was required to operate and manage the autoexport plugin which has
 since been deprecated. see https://issues.apache.org/jira/browse/INFRA-7487

* An inter-project communication site was requested by the community at
 ApacheCon and is being looked into by infra. This will essentially be an
 aggregator of project development wishes/requests, and will most likely
 reside on wishlist.a.o.

* As a way of lowering the bar for and securing security reports, infra is
 looking into creating a system which, based on LDAP, accepts and encrypts
 security reports for projects. The exact setup and nature of this system is
 being discussed, primarily with members of the subversion PMC.

* Heartbleed happened: see
 https://blogs.apache.org/infra/entry/heartbleed_fallout_for_apache

* Two members of the infrastructure team attended Apachecon NA 2014 and had a
 few community sessions with committers to hear their concerns and attempt to
 address them.  Also met with Cloudstack members to discuss their widely
 publicized proposal for additional infrastructure needs surrounding project
 builds.

19 Mar 2014

New Karma:
==========
mdrob added to infra-interest.


Finances:
==========


Operations Action Items:
========================


Short Term Priorities:
======================

* Look into mac build slaves.


* Converge on git.apache.org migration to eris. (Step 1 is merge git -> git-wip on tyr)
 (opinions?)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken place.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.
 (Support case closed, nothing useful came from it other than check the logs.)

* Port tlp creation scripts over to new json-based design on whimsy.


Long Range Priorities:
======================

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

* Institute egress filtering on all mission-critical service hosts.


General Activity:
=================

* The new GitHub features have been well received, with 28 projects already
 onboard with the new features in February alone. As a result, the number of
 github related messages on the public ASF mailing lists have risen from 304
 in January to 3,616 in February, with expectations to exceed 5,000 in
 March. There has been a discussion on whether to transition from opt-in to
 opt-out on these features, but for the time being, it remains opt-in.

* Instituted a weekly cron to inform private@cordova about the current list of
 committers not on the PMC, which should be the empty set.  Currently about a
 third of the pmc is impacted with no indication that this will ever be
 addressed by the chair- the requisite notices have already been sent to
 board@.

* Discussed the current state of affairs with our build farms as they relate to
 TrafficServer's needs.  We intend to address this with increased funding in
 next year's budget.

* Received a report about several compromised webpages hosted by VM's
 associated with OfBiz.  In the process of working with the PMC to correct
 this situation.

19 Feb 2014

New Karma:
==========


Finances:
==========


Board Action Items:
===================


Short Term Priorities:
======================

* Look into mac build slaves.


* Converge on git.apache.org migration to eris. (Step 1 is merge git ->
 git-wip on tyr) (opinions?)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken
 place.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.  (Support case closed, nothing useful came from it
 other than check the logs.)

* Port tlp creation scripts over to new json-based design on whimsy.

* Ensure all contractors are participating in on-call situations, minimally by
 requiring cell-phone notification (via SMS, twitter, etc) for all circonus
 alarms.

* Explore better integration with GitHub that allows us to retain the same
 information on the mailing list, so that vital discussions are recorded as
 having taken place in the right places (if it didn't happen on the ML...).

Long Range Priorities:
======================

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.

* Institute egress filtering on all mission-critical service hosts.


General Activity:
=================

* Migrated dist.apache.org from backups of thor to eris.  Unfortunately a
 dozen commits were naturally lost in the process.  Thanks to TRACI.NET for
 providing additional bandwidth for this purpose.

* Jira: Jira is now runnning on Apache Tomcat 8.0.0 (rather than 7.0.x). While
 running on 8.0.x is unsupported by Atlassian, this is providing valuable
 feedback to the Tomcat community. To mitigate the risk of running an
 unsupported configuration, Jira is being monitored more closely than usual for
 any problems and there is a plan in place to rollback to 7.0.x if necessary.

* At the behest of committers, we have started working on a stronger
 implementation of GitHub services, including 'vanity plates' for all Apache
 committers on GitHub.  A method of interacting with GitHub Pull Requests and
 comments has been completed, that both interacts with the GitHub interface
 and retains all messages on the local mailing lists and JIRA instances for
 record keeping. At the time of writing, we have 367 committers on the Apache
 team on GitHub. We have made a blog entry about this at
 http://s.apache.org/asfgithub which seems to have reached many projects
 already.  Furthermore, the Incubator has been involved in the development of
 this, and are thus also aware of its existence and use cases.

* The new SSL wildcard was obtained from Thawte earlier this month, and will
 be rolled out to services very soon. Thanks to jimjag this got the business
 end of the deal done so we could actually get the cert in before the incumbent
 expires.

* All remaining SVN repos have now been upgraded to 1.8.

* Resurrected thor (mail-search) after soliciting help from SMS for on-site
 repairs.

* Amended release policy to provide rationale and spent time explaining the new
 section to members@.  See http://www.apache.org/dev/release#why

* Work with Cordova on processing their historical releases to comport with
 policy.

15 Jan 2014

New Karma:
==========


Finances:
==========


Board Action Items:
===================


Short Term Priorities:
======================

* Look into mac build slaves.


* Converge on git.apache.org migration to eris. (Step 1 is merge git ->
 git-wip on tyr)
 (opinions?)

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call has taken
 place.

* Look into rsync backup failures to abi. Look into clearing out a lot of room
 on abi - currently 20GB left and 20GB+ a day gets backed up.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less
 painful this time around.
 (Support case closed, nothing useful came from it other than check the logs.)


Long Range Priorities:
======================

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache
 services.


General Activity:
=================

* Confluence: Finally got it to upgrade to 5.0.3. Database edits and
 conversions were needed to make the transition. After a few days bedding in
 it seems to be performing much better than the previous version.

* Translate.a.o: Upgraded to 2.5.1-RC1 (that is a release). Severe
 compatibility issues. Reprogrammed part of LDAP connection, to make it more
 stable (and work).

* [2nd Jan 2014] - Jenkins Master was migrated to a much needed new server.
 This also eases the pressure from Buildbot Master since the split of hosts.

* Migrated SVN repositories to newer, larger, and hopefully quicker array on
 Dec 31st. The repository upgrades will now be done in the coming weeks once
 we have seen stability in the Infra repository for at least 1 week. We will
 then likely re-purpose the SSD in the old array and add them to the new
 array for improved caching. Total downtime for the move was 1h15m as the
 prep work had been undertaken for at least 2 weeks before.

* RE: Symantec code signing service - There are a handful of internal tasks to
 complete before we can move on.

* Migration and reinstallation of continuum-ci.a.o (was vmbuild.a.o) has taken
 place.
 [Final checks are in progress before announcing its GA]

* [5th Jan 2014] - blogs.apache.org was upgraded by the roller project

* Faulty gmirror disk on eris, liaised with OSUOSL and swapped out disk.

18 Dec 2013

New Karma:
==========


Finances:
==========

* Funded Daniel Gruno's attendance at EU Cloudstack conference: cost TBD.


Board Action Items:
===================


Short Term Priorities:
======================

* Clear the lengthy backlog of outstanding tlp-related requests.

* Repurpose the new hermes gear for use as a (jenkins?) build master as that is
 more pressing.

* Investigate the migration tooling available for conversion from VMWare to
 Cloudstack [See attachment INFRA-1].

* Look into mac build slaves.

* Migrate eris svn repos to /x2, converting everything to 1.8.

* Converge on git.apache.org migration to eris.

* Investigate / negotiate external code-signing capability, currently in talks
 under NDA. INFRA-3991 is tracking the status, and a Webex call is being arranged.

* Look into rsync backup failures to abi.

* Complete nagios-to-circonus migration for monitoring.

* Continue to experiment with weekly team meetings via google hangout.

* Continue with the effort to reduce the overwhelming JIRA backlog. At the start
 of the reporting period we started with 134 open issues. We are now down to
 ~90 open issues.

* Jan Iversen has been pushing through the outstanding Tlp requests for virtual
 machines.  Several projects should by now have their VM.

* Explore the possibility of revamping the infra documents to have a more
 intuitive feel about them, improve readability.

* Confluence Upgrade. Needs an intermediate upgrade to 5.0.3 then to latest.
 Attempts have been madse and failed to upgrade to 5.0.3, opened a support case
 but we are cotninuing to try on a test instance.

Long Range Priorities:
======================

* Choose a suitable technology for continued buildout of our virtual
 hosting infra.  Right now we are on VMWare but it no longer is gratis
 software for the ASF.

* Continue gradually replacing gear we no longer have any hardware warranty
 support for.

* Formulate an effective process and surrounding policy documentation for
 fulfilling the DMCA safe harbor provisions as they relate to Apache services.


General Activity:
=================

* Both new tlp's this month, Ambari and Marmotta, were processed within 24
 hours of board approval.


Attachment INFRA-1: Cloudstack Conference Feedback [Daniel Gruno / humbedooh]:
==============================================================================

Attended CCC (Cloudstack Collaboration Conference) at Beurs van Berlage
in Amsterdam. Tried out Cloudstack locally with a /27 netblock, as well
as on testing platforms available at the conference. Apart from minute
errors in the UI (which I have reported), it seems to be working as
expected. Cloudstack supports LDAP integration, however this is not a
feature complete integration, and it is my view that an infra-made LDAP
implementation - with regards to _non-infra involvement_ - is preferred,
though we may elect to use it for the administration of the hosts.

Attended a talk about Apache LibCloud which seamlessly integrates with
Cloudstack for an easy programmable management of VMs via Python. This
removes the need for dealing with the rather cumbersome Cloudstack API,
and enables the possibility of creating an infra-managed site for
dealing with VMS in several ways. Should we ultimately decide on another
cloud solution, LibCloud integrates with just about every platform out
there, and so would not be affected by this to any large degree. I did
not get a chance to properly test LibCloud, so my findings in this regard
will have to be substantiated at a later date.

Cloudstack offers support for both VMWare (WebSphere), Xen(cloud), KVM,
so migrating is just as much a question of "if" rather than just "when".
It supports using different hypervisors on different pods (a collection of
hosts), so working in tandem with a KVM or similar free hypervisor is an
option.

Migration options (assuming we go with KVM or similar):

A) Dual hypervisor mode (use both WS and KVM, only allot new VMs on KVM?)
B) Migrate WS boxes to KVM (Qemu-KVM supports this natively with VMWare
  version 6/7 disks)

if A, then we need to use separate pods for WS and KVM.
if B, then we pull boxes offline, one by one, move the images
     to the new host and KVM can handle the images.

Tentative proposal for future VM management:
Create one or more hosts with KVM in CS(or OS), assign a pod to the old WS
clients, use Apache LibCloud within an LDAP-authed site (TBD) where PMC
members can request, restart, get access to, and resize (to be acked by
infra) instances. Liaison with Tomaz Muraus(LibCloud), Chip
Childers(Cloudstack) etc on the actual implementation details. This
would mean that infra's only role would be to ack the creation/resizing
of VMs and general oversight, rather than manual creation/modification
of each VM. I expect to have a mockup of what such a site could look like
ready for infra to review and discuss medio December, thus adding something
of value to the next board report about it.

Jake Farrell has offered to help with the CS setup, as he has experience
running this in large environments.

There have been some discussions of maybe using other management platforms
instead of CloudStack, but given that CloudStack and LibCloud are Apache
projects, it is my opinion that we are easier suited, support-wise, by
using software developed by the foundation, as well as the proverbial
"eating our own dog food".

20 Nov 2013

Discussed funding a pair of contractors to attend a Cloudstack
conference to gain additional skills- approved by VP Infra.  Only
one will actually attend.

Acquired a free license for Jira Help Desk - rollout forthcoming.

Installed wildcard SSL cert for *.openoffice.org.

In pursuit of outsourced code-signing capability for project
releases.  Negotiations have reached the NDA phase.

Migrated the bulk of our SQL infra to a centralized database server.

Discussed replenishing our Mac build infra.

Purchased a wildcard cert for *.incubator.apache.org.  3 years
at $475 per year.

Began holding informal weekly meetings via google hangouts.  Open
to all infra-team members.

Had a configuration regression regarding the PIG and DRILL Confluence
wikis, which allowed additional spam to reappear on those spaces.

Apachecon.eu DNS reacquired from our registrar.  Somehow it wasn’t
configured to autorenew so we lost that domain for a few days.

We are still considerably behind the curve in our Jira workload and
that is starting to inform some of the reporting at the board level.
Please be patient while we continue to ramp up with existing personnel
to support the org’s continued growth.  In response we have organized
a monthly jira walkthrough day dedicated entirely to outstanding
jira requests.  Raw jira stats show we have made significant progress
over the past month and we expect that trend to continue, with 116 opened
vs. 166 closed.

Aegis is reporting a bad disk and it needs to be replaced as the host
is seriously underperforming in its current state.

Dell has solicited a warranty renewal offering for arcas, our jira server.

We need to sort out licensing for our VMWare infra as we are currently in
a holding pattern for new VM’s until this gets resolved.

We’ve disabled the user ability to edit their profile page in confluence,
eliminating another common source of spam.

16 Oct 2013

An onslaught of Confluence spam required us to change
the default permission scheme to match what we've done
for the moin wiki.  We've also formally withdrawn all
support for the autoexport plugin.

Closed out the account of a deceased committer.

Received delivery at OSUOSL of the gear we ordered last
month.  Now in the process of bringing it online.

Upped the default per-file upload limit for dist/ to
200MB (from 100MB).

18 Sep 2013

Discussed recent scaling issues with roller and made
appropriate adjustments to the install.

Discussed picking up an SSL cert for openoffice.org.

Dealt with disk issues in the Y! build farm.

Dealt with high CPU consumption on the moin wiki.

Dealt with a vulnerability on the analysis VM.

Dealt with disk performance issues on erebus (VMware).

Discussed creating a dedicated database server again.

Dealt with a wide brute force password guessing attempt
against our LDAP database.  About 800 users were impacted,
none of them apparently had their passwords guessed.

Replaced a bad disk in hermes (mail).

Setup an organizational account with Apple to allow devs
to put their wares in the App Store.  Cordova will be the
first guinea pig.

Ordered some new Dell gear for slated for replacements of
existing hosts.  Trying a new supplier largely for cost
savings.

Trying unsuccessfully to get our free VSphere license
updated.

A contributor inadvertently included customer data in two bugzilla
attachments, and politely requested that it be removed.  This request
was initially denied based on a careful reading of the current policy.
Subsequently, the author of that policy described his intent, the
contributor provided more information as to what they were requesting
to be removed and why, and as a result, the request was implemented.
The infrastructure team plans to revisit whether or not the policy
needs to be updated.

As to the potential policy change, there is a saying in legal circles
that "Hard cases make bad laws"[1] -- as well as a saying that "Bad
law makes hard cases"[2].  To some extent, both apply here.  The
overwhelming majority of requests for deletions are for people who
want something removed from a mailing list that is widely archived and
mirrored.  Often these requests come in after a considerable period of
time has elapsed.  For these reasons, it probably is best that the
documented policy continues to set the expectations that most requests
will be denied -- and further I believe that we should be open to
granting exceptions whenever possible.

Reflecting on (a) the low frequency with which exceptions will be
granted, (b) the amount of effort it took to resolve this, perhaps the
simplest thing that could possibly work would be an addition of a
statement like the following:

Exceptions are only granted by the VP of Infrastructure; request for
removal of items that have already been widely mirrored outside of the
ASFs control are unlikely to receive serious consideration.

[1] http://en.wikipedia.org/wiki/Hard_cases_make_bad_law
[2]http://en.wikipedia.org/wiki/Hard_cases_make_bad_law#Bad_law_makes_hard_cases

21 Aug 2013

A report was expected, but not received

17 Jul 2013

Discussed the logistics of bringing the ACEU 2012 videos online.

Dealt with the fallout surrounding the recent javadoc vulnerability.

Discussed upgrading our VSphere license with VMWare.

We continue to iron out kinks with our new circonus monitoring service.

Added a brief mission statement here:
http://www.apache.org/dev/infra-contact#mission

Upgraded svn on eris (svn.us) and harmonia (svn.eu) to current versions.

Discussed making shell access from people.apache.org an opt-in service.

19 Jun 2013

Purchased 9 disks from Silicon Mechanics to fill out the eris array -
cost ~$1400.

Promoted Jan Iversen to the Infrastructure Team.

Had oceanus (new machine) racked in FUB.

Mark Thomas kicked off a new round of FreeBSD upgrades.

Disabled CGI support for user home directories on people.apache.org.

Purchased a wildcard cert for *.openoffice.org at $595/year from digicert.

Daniel Gruno was given root@ karma and will need to be added to the
committee as a result.

Setup a new VM with bytemark for circonus-based monitoring.

Work continues around the Flex Jira import problems.

Acquired several new domains for management by us instead of external
parties.

15 May 2013

Completed our budget deliberations including funding for a new
part-time position.

Purchased 3 new HP switches to replace our aging Dell switches.
Cost ~ $4700.

Continued discussion of code-signing certificates for our
projects.

Dealt with some failing/overloaded build machines in our Y!
farm.

Jan Iversen continued to work on our nagios -> circonus
service monitoring migration.

2 disks have failed in loki (tinderbox), we've replaced one from
inventory but will need to order more to complete the replacement.

Experienced some security / porn issues with the moin wiki and
have upgraded to the latest version to assist with controlling
the spam.

We will be disabling password-based ssh access to people.apache.org
in the near future, once the supporting scripts have been tested.

Rainer Jung was granted root karma and needs to be added to the
formal committee roster.

17 Apr 2013

A report was expected, but not received

20 Mar 2013

About to spend ~$1500 for additional drive capacity for eris (svn.us).

Updated our inventory with Traci to better align with their power-cycling
service.

Set up wilderness.a.o lua playground for Daniel Gruno.

Granted danielsh and gmcdonald IRC cloak granting karma.

Upgraded bugzilla to 4.2.5 in response to a vulnerability announcement.

Jira was upgraded to the latest stable (5.2.8).

Daniel Gruno setup a direct SMS service for root-users to take advantage of.

Again discussed timing issues surrounding the dissemination of authoritative
declarations about newly minted TLPs.

Setup paste.apache.org - a pasting service for committers to use.

Still dealing with the fallout of our failed rack-1 switch.  Now pursuing
indirect support with Dell.

Received/deployed the new box for an additional vmware hosting service.

Upgraded httpd on eos and aurora(www/mod_mbox service) to 2.4.4.

Restored archiving for the EA mailbox.

Jan Iverson upgraded mediawiki for Apache OpenOffice due to an announced
vulnerability issue.

Working with concom to setup a USB disk with video data on it in one of
our OSUOSL racks.

Pruned the apmail group list down to the relevant current active folks.

20 Feb 2013

Placed a $15K order for a new vmware server with Dell which is now on
backorder through February.

Still dealing with the fallout of losing one of our public switch
interfaces.

Uli is working on OOB access for us at FUB.

Specced some additional drive capacity for eris.

Discussed setting up apaste.apache.org as a pasting service for Apache
based on Daniel Gruno's apaste.info site.

Started the process of reigning in the abusive maven traffic to
svn.apache.org.

Started the process of dealing with the missing Flex attachments for a
Jira import.

Shut off the people -> www rsync jobs for our websites. All project sites
now MUST be on either svnpubsub or the CMS to continue to be maintained.

Enabled redirects for our svn.apache.org services for graduated podling
trees.

Upgraded the software on adam (OSX) for $40 thanks to Sander Temme.

Was contacted by Traci to update our inventory with them.

16 Jan 2013

Rainer Jung and Jan Iversen have done a stellar job of
coordinating and collaborating on a new wiki host for
openoffice, among other activities from Rainer in particular.

Discussed with the secretary the best way to populate
a new tlp's LDAP groups from either the unapproved minutes
or the agenda file.  After a bit of back and forth we
settled on the agenda file for the time being, however
this effort remains a convenience for the incoming chair,
not a means of setting up the groups in a permanent and
official way.  The chair remains responsible for vetting
the groups post-setup.

Had a robust yet not entirely satisfying discussion with
the membership about relaxing ACL's for our subversion
service, culminating in the following url

 http://www.apache.org/dev/open-access-svn

It is expected to take an extended period of time before
such changes can be effectively implemented as infra policy,
but the goal of simply raising awareness has been met.

We are in the final phases of withdrawing support for rsync-backed
websites, which we expect to complete before the end of the month
rolls around.  At this time there are still several outstanding
projects who have yet to file a jira ticket with us to migrate
to either svnpubsub or the CMS, and their ability to continue
to service their live site with updates and new content will
be impacted.

Daniel Gruno and others have been working on a gitpubsub
service over on github and have rolled out a demo version
to our live writable git repos.  We expect even more coordination
between the svnpubsub service maintained by the subversion
crew and the gitpubsub service Daniel and others have worked
on.

19 Dec 2012

Lost our public VLAN in our rack1 switch for undetermined reasons,
probably due to a misconfiguration on the OSUOSL side.  Will continue
followup with OSUOSL for eventual resolution.

Enabled core dumps on one of our mail-archives servers to better diagnose
the nature of the ongoing segfaults.

Specced a new VMWare host to offer additional VM's to our projects, then
haggled with each other over the config.  Now appears we're going to
repurpose chaos (36 disk enclosure) to serve up a Fiber Channel interface
to the new host.

The tlpreq scripting is now in place and ready for new graduating projects
to use.  We will pass along the details to the Incubator for podlings due
to graduate in December.

Decided we're comfortable with the OpenOffice project keeping at most 2
releases on the Apache mirror system (/dist/) at any one time.  Henk
Penning has communicated this back to the AOO PMC.

Came across a bizarre privacy information leak with Jenkins and LDAP.
We've patched our installation to mitigate the issue.

Discussed our near-term plans for git hosting on various lists.  One of
the exchanges was needlessly heated, and we have tried to rectify the
situation with better documentation and a bit less BOFH tactics.

21 Nov 2012 [Sam Ruby]

Restructured the creation of new mailing list infrastructure:
new "foo" lists will be named following this convention:

 foo@$podlingname.incubator.apache.org,

instead of the now antiquated "$podlingname-foo@incubator.apache.org".

Similarly restructured website assets for new podlings to use

 http://$podlingname.incubator.apache.org/

instead of the prior "http://incubator.apache.org/$podlingname/".

These changes will help make migration to TLP status easier for
both the podling and the Infrastructure Team.

Coordinated with Sally with respect to the Calxeda/Dell ARM donation.

Worked with OSUOSL to mitigate the downtime of a series of
scheduled network outages affecting our .us services.

Discussed the pros and cons of using Round-Robin DNS for websites.
No action taken moving us away from RR DNS.

Rainer Jung upgraded our webserver install on eos (www.us) to the
latest and greatest version of 2.4.x.

Mark Thomas upgraded all 3 bugzilla instances in response to a security
vulnerability report.

Some generic stats detailing the org's recent growth:

 New committer intake: ~300 / year like clockwork over the past decade.

 New TLP graduations:
                       2010: ~20
                       2011: ~10
                       2012: ~30

 New INFRA issues:  http://s.apache.org/INFRA-Creation-2005-2012

 Mailing List / Subversion activity: http://www.apache.org/dev/stats/ [*]

 Average Subversion traffic (hits): consistently ~ 3.2M / day for the past
 few years.

 Inbound Mail traffic: 250-300K connections per day for the past few years.

 Average Website traffic (hits):
                        Nov 2010: 10M / day
                        Nov 2011: 11M / day
                        Nov 2012: 21M / day [**]

 Rough Download Page traffic (hits):
                        Nov 2010:  48K / day
                        Nov 2011:  46K / day
                        Nov 2012: 145K / day [***]

 46 Virtual Machines (18 new within the past year) and 24 additional ARM
 servers due to the Calxeda/Dell donation.

[*] - clicking on the mailing list graph shows incubator + hadoop + lucene
   is now responsible for 40 % of the org's total mailing list traffic.

[**] - 10M / day due to www.openoffice.org.

[***] - 100K / day due to openoffice (note openoffice users typically upgrade
using the openoffice software itself rather than by visiting the download webpage)

AI: Sam follow up with regard to git post-commit hook support by infra

17 Oct 2012 [Sam Ruby]

Expressed some concerns about the ongoing volunteer
support and documentation for nexus (repository.apache.org).

Spoke with a few PMCs about cleaning up temporary artifacts
in their website rsync ops.

Discussed infra meetup plans to coincide with Apachecon EU.

Due to the fact that the CIA.vc service was shut down, we
are considering other options to provide the same functionality
to our projects.

Dealt with some issues surrounding the generation of
projects.apache.org pages.

The contract for colocation service with FUB has been signed
by FUB and awaits our counter sign.

Started discussing various approaches to simplify the podling
graduation process from an infra standpoint.

19 Sep 2012 [Sam Ruby / Jim]

Daniel Gruno added to the Infrastructure Team.

Bought a pair of 1TB SATA drives, one of which was
used to replace a bad disk in hermes (mail).

Calxeda donated access to a 24-node ARM-based hosting
service to be deployed as part of our build farm offerings.

Removed the custom jira patch licensing plugin largely
because no one wanted to continue to maintain it.

Migrated jira service to its own hardware (a spare r410)
for stability reasons.

Picked up a 7 more SATA drives for inventory and replacement.

VP Infra pointed out various rumblings rising to the board
level about contractor communications, and had various ideas
about how to address that.

Started work on a Circonus-based service to replace our aging
nagios installation.

AI Jim follow up with infra regarding git status

15 Aug 2012 [Sam Ruby]

No report was submitted.

Will report next month.

25 Jul 2012 [Sam Ruby]

Updated the mailing list creation process- see
https://infra.apache.org/officers/mlreq

Determined that eve.apache.org, one of our Apple Xserves,
is no longer capable of productive service due to various
hardware faults.

Discussed acquiring additional "cloud" services for our
projects to use, ostensibly thru some unspecified bidding
process.  Nothing much came of it.

Approached by Dell regarding warranty renewal for selene
and phoebe (Geronimo TCK build farm).  We declined.

More discussion, much of it less than constructive, about
providing a digital signature service to Apache project
releases.

Worked out a deal with Calxeda to provide a few ARM-based
build servers for our projects to use (at no cost to us
other than admin time).

Granted Philip Martin of the Subversion project access to
eris (US svn server), mainly for his offer to help with some
svn server debugging.

Working with our main DNS provider no-ip.com to stabilize our
account services to better deal with the dozens of extra domains
the AOO project needs us to host.  We've been getting gratis
service to this point, but we are willing to pay for better
responsiveness and additional features only available with a
paid support plan.

Discussed plans for an infra meetup to roughly coincide with
ACEU.

Work on the backup system migration from bia (in Los Angeles)
to abi (in Fort Lauderdale) is nearing completion.

Some progress was made on getting the number of outstanding
jira tickets down to normal levels.  We've reclassified tickets
based on whether they are "waiting for user" input or "waiting for
infra", which has helped, but the bulk of the open tickets still
are "waiting for infra".  We expect to continue to make progress
on this over the coming days and weeks, and will continue to report
on it until we are satisfied things have returned to an acceptable
state.

Daniel Gruno put together a nice comments service at comments.apache.org
for project websites to take advantage of.

20 Jun 2012 [Sam Ruby]

blogs.apache.org has returned to normal service post-upgrade; thanks to Dave
Johnson for the particulars.

Migrated our jira instance to our shiny new phanes/chaos VM cluster and it
seems to be performing rather well now.  Thanks go to Dan Kulp for debugging
the svn plugin for us, as well as culling the jira-administrators group to sane
levels (projects will need to make better use of roles).  Upcoming work to
include flushing the backlog of pending jira imports, which we've now started
with Flex.

Discussed options for "Cloud support" for a certain GSOC project. Conclusion
was that we don't currently have a suitable arrangement worked out with a cloud
provider to offer the ASF the enterprise setup that we'd require.

Did the password rotation dance again- this time there were no malicious
activities surrounding the action.  See http://s.apache.org/zZ for details.
The situation has since returned to normal now that we've reenabled committer
read access to the log archives on people.apache.org.

Discussed java hosting futures with members of the FreeBSD community in
light of the fact that Atlassian does not consider FreeBSD a supported
platform, which at least partially motivated our move of jira from a
FreeBSD jail to an Ubuntu based VM.

imacat added to the Infrastructure Team to help support the ooo wiki and
forums platforms.

Work on new harmonia is currently underway- colocation provided by Freie
Universität Berlin (FUB).  Uli Stärk is our lead on this.

Decided not to pursue a meetup at the Surge conference this year, preferring
to get together at one of the upcoming Apachecon conferences.

Discussed support status for git and documented our plans for bringing
it to a fully supported service over the next 3-6 months, culminating
in the following awkwardly tautologous statement by VP Infra:

 "The infrastructure team has four full time contractors and a variable
 number of volunteers and is committed to supporting both git and
 subversion."

Experienced extended downtime for reviews.apache.org after an OS upgrade
busted our install.  Dan Dumont from Apache Shindig has been assisting
us in recovery- we expect the service to return to active status by
the time of the board meeting.

We've fallen a bit behind in our caretaking of jira issues mainly due
to a number of new graduations from the Incubator.  We'd like to return
the number of outstanding issues to "normal levels" within the next reporting
period, which seems a reasonable goal given the expected (small) number of new
graduations happening this month.

We upgraded people.apache.org (aka minotaur) in light of the recent security
reports from FreeBSD concerning a local root exploit.

Considerable work has gone into scripting various workflows around common
requests like mailing lists, git repos, and CMS sites.  We've subsequently
created infra.apache.org to house these efforts once they've fully gelled
from their development versions at whimsy.apache.org.

Daniel Shahaf is stepping back to part-time for three months starting in July.

16 May 2012 [Sam Ruby]

Added Mohammad Nour El-Din to the Infrastructure Team.

Upgraded Jira to version 5- kinks still being ironed out.

Had OSUOSL install our recent purchases from Silicon Mechanics
and Intervision.

Coordinated with the OpenOffice PPMC and Sourceforge regarding
the distribution of AOO's 3.4.0 release.  Most of the traffic
was handled by Sourceforge's CDN instead of our mirror system,
at a rate of over 20 TB of download traffic a day.  Download
stats will be published by the AOO PPMC in the near future.

Replaced isis's bad disks and brought up 2 additional build
hosts at our Y! datacenter.

Discussed another F2F meeting during the Surge 2012 conference.
Not a lot of expressed interest so far.

Experiencing uptime issues with our monitoring.apache.org Bytemark
VM ever since they migrated the VM to different hardware.

Discussing git hosting options again, and again, and again...

18 Apr 2012 [Sam Ruby]

Coordinated the installation of aurora (www.eu) in SARA
with Bart van der Schans.  We are now out of free space
in that datacenter.

Our new backup server abi is in service at Traci.
We've arranged for Sam to have access to the joes-local
safe deposit box in an emergency.

Worked out an informal deal with SourceForge to assist
with the delivery of OpenOffice releases.  Whether or
not this continues to be used beyond the first release
is up to the AOO PPMC.

Henk Penning is putting the final touches of providing
optional Apache-mirror support for OpenOffice releases.

Bought an array from Silicon Mechanics for about $12K.
The host to attach it to will be purchased through
Intervision in the very near future.

Updated the AOO bugzilla logins to reflect the fact that
we host the openoffice.org domain but no longer allow mail
for it.

Considering alternatives to directly importing Flex's jira
data into the main jira instance due to repeated failed attempts.
Atlassian refuses to assist us until we can demonstrate the
identical problem on a supported platform.

Submitted our budget for review and approval.

Harmonia's (svn.eu) disk subsystem finally stopped performing
well enough to continue using it as our European svn slave.
Specced a replacement host with Uli Stark's help.

Fleshed out a deal with Freie Universitat Berlin for colocation
services, targetting new harmonia as the first host to deploy there.

Set a soft limit of a combined 1GB worth of release artifacts
dropped onto the mirror system- anything exceeding that figure needs
to coordinate with infrastructure in advance.

21 Mar 2012 [Sam Ruby]

All outstanding bills have been paid.

Spoke with the myfaces PMC about their public-facing
maven repo on their zone. The zone has been taken down.

Discussed replacement purchase for harmonia (svn.eu)
and and additional VMware server for linux vm's.

Upgraded the majority of our FreeBSD servers to 9.0.
Will complete the remaining ones in the near future.
Next time round we will likely enable dtrace throughout.

Discussed infra's authority to pull improperly-signed
releases from the mirrors.  VP Infra concurs we can/should
do this if appropriate justification is available.

Mark Thomas upgraded our Bugzilla installs to 4.0.4.

Dealt with our monitoring host being redeployed to a new
VM server.

Discussed hosting a yum repository for project releases.
No decision was made at this time, pending further input
from volunteers.

Discussed deploying an Australian svn mirror to Gav's
local server.  No decision was made at this point.

Greg Stein has taken svnpubsub over to Subversion's trunk
and has done a lot of work on improving the svnwcsub service.
We look forward to seeing svnpubsub distributed in a subversion
release!

Established the precedent of not resetting accounts for
returning committers that are no longer a part of any active
project.

Experiencing chronic problems with isis, one of our build hosts
in Y!'s datacenter.  Y! has offered us a pair of new servers
to supplant it.

Began discussion of a budget for FY2012-2013.

Posted a few CMS-related blog entries to http://blogs.apache.org/infra
describing recent activity.

15 Feb 2012 [Sam Ruby]

Still attempting to pursue a github FI instance for ASF use.

Intervision LOC app has still not been filled out and sent off.

Purchased an HP switch for SARA for 456 EU.

Renewed warranties on Dell and Sun gear for 1Y with Intervision
and Technologent, respectively.


Prepped aurora and the HP switch for installation in SARA.

Renewed apache.org DNS for another 9 years.

Silicon Mechanics reminded us of our outstanding $2037 credit
with them.

We've sent out a notice to all PMCs about the plan to migrate all sites
to svnpubsub or the CMS by the end of this year.

24 Jan 2012 [Sam Ruby / Jim]

Still attempting to pursue a github FI instance for ASF use.

Intervision LOC has still not been filled out and sent off.

We've determined there is adequate space (9U free) for us
to install our new gear in SARA.  However we are in need
of an additional switch, which is tasked to Uli Stärk for
purchase.

Awaiting a final decision from the OpenOffice podling regarding
hosting of extensions and templates.  They are considering
either hosting those services locally at the ASF or with
SourceForge.

Joe spent a few hours training Melissa (EA) on svn use.
A google calendar is now available for everyone's use
thanks to Melissa.

Secured and rolled out a wildcard cert from Thawte.

Dealt with a minor security issue in the Nexus installation,
reported by Sebastian Bazley.

Partitioned our SSL termination for our linux hosts to separate
VM's for additional security.

Began deploying puppet for eventual management of our linux hosts.

Dealt with a benign security incident on modules.apache.org.

Further enhanced CMS performance with the introduction of parallel
builds.

21 Dec 2011 [Sam Ruby / Jim]

Floated the concept of providing Sam Ruby a place to supply
a grab bag of useful CGI scripts he has developed over
the years.  Originally hosted on the id.apache.org server
as rubys.apache.org but is in the process of being redeployed
to a separate VM named whimsy.apache.org.

Rejiggered switch 2 (our "private" switch) with OSUOSL
help to better align it with our goals of a single switch
per cabinet.

Setup tethys.apache.org for dedicated service to the OpenOffice
extensions and templates services.  As these services include
providing downloads for non-open-source licensed products,
explicit permission by VP Infra for this purpose was granted.
VP Infra delegated the final decision on licensing to the
Incubator PMC once OpenOffice proposes to graduate.

Setup translate.apache.org which is a pootle-backed service
for projects to use in their language translation efforts.

Initiated "Git Friday" to focus on git-related support issues
across the whole team on a weekly basis.

Discussed installation of new aurora into SARA with Bart van
der Schans and Wim Biemolt of SARA.  Progress is gated on
Wim getting back to us on the status of our to-be-determined
additional rack space.

Tony Stevenson put in his notice to return to paid contractor
status starting Jan 1.

Opened the floor to input on contract terms for sysadmins, in
particular the immediate part-term post(s).  General feedback
on the idea of including explicit metrics in the terms was
negative.

Wrote a testimonial for the FreeBSD Foundation regarding our
FreeBSD deployments.

Gave the CMS a big performance boost by replacing the rsync
usage with zfs clones.

Put out an RFP for participation in our ongoing alpha test
of git hosting.  7 projects responded; all 7 were accepted:
 wicket, callback, cassandra, s4, cassandra, deltaspike,
 trafficserver, deltacloud.
Participating projects are required to provide an infra
volunteer to work on the git hosting code.

Got Thawte's nod for a wildcard cert via Bill Rowe's contacts
(waiting on Sam Ruby to chase up Brian the sales rep).

16 Nov 2011 [Noirin Plunkett / Jim]

Accepted a "loan" from IBM for a PPC64 server to be
incorporated into our build farm.

Welcomed the new VP Infra Sam Ruby to the leadership
post for Infra.

Changed our plans from simply replacing the bad disk
in harmonia to replacing the entire host asap.

Completed the migration of the openoffice.org domain
to ASF control.

Discussed mirroring options for the large set of artifacts
produced in OpenOffice releases.

Registered the apachecon.eu domain with our dotster account.

Held a minor infra meetup at Apachecon led by Philip Gollucci
to discuss transition issues for the VP role.

Discussed a few proposed mail templates to send out regarding
our svnpubsub migration plans for dist files and websites.

Pushed to resolve a few outstanding DNS migration issues for
the subversion, spamassassin, and libcloud domains.

26 Oct 2011 [Philip Gollucci / Jim]

Board Action Items:
===================
Intervision LOC needs to be signed,filled out,and sent
approved/dell-2011-05.pdf needs to be moved to paid by treasurer@
approved/Noirin-Infra-flights.txt needs dealing with

No other items are expected to be changed in Staff SA contracts.
Need signature by an officer who is aware of the new
terms.

General Activity:
=================
The harmonia (svn.eu) bad root disk situation remains
unresolved.   We expect to purchase a replacement disk
with Uli's help and have Bart van der Schans install it
shortly.

Bart has received and tested the replacement server
for aurora (www.eu).  We need to schedule a date for
decommissioning and eventual reinstall of aurora soon.

Held an infra meetup to coincide with the Surge conference
in Baltimore, MD.  Discussed plans for the remainder of the
fiscal year. Tabled the staff review b/c the president could
not attend.  We'll reschedule.
See notes here -- http://s.apache.org/surge2011infra

Pursuant to explicit authorization from the board, started
the git hosting experiment with CouchDB as the initial guinea
pig.

Renewed the spamassassin.org domain for 1 year.

Updated people.apache.org's operating system in light of the
published local-root vulnerability in FreeBSD.  Other FreeBSD
hosts are scheduled to be upgraded to FreeBSD 9.0 on release.

Began work rationalizing the state of our switches at OSUOSL.
The current situation is a mess, we're moving to a one-switch-
per-rack configuration with no cross-cabling between racks
other than through the patch panel.

Re-initiated the transfer of the openoffice.org domain to the ASF's
dotster account.

Renegotiated the financial terms of Gavin's and Joe's contracts.

Began talks for another infra meetup sometime during Apachecon NA 2011.

21 Sep 2011 [Philip Gollucci / Jim]

New Karma:
==========

Finances:
==========

Payments to staff were about 1 week late this month.

Board Action Items:
===================
Intervision LOC needs to be signed,filled out,and sent

approved/dell-2011-05.pdf needs to be moved to paid by treasurer@

General Activity:
=================
Response time on new account requests remains under
2-3 days.

The harmonia (svn.eu) bad disk situation remains unresolved
at this time.

Ordered a replacement host for aurora (www.eu) from a Dell
reseller in Germany. Cost = 5259.80 EU.  It is to be shipped
to Hippo for eventual installation by Bart van der Schans.

Daniel Shahaf improved our automated banning of abusive IP
addresses with respect to svn traffic.

Failed to successfully incorporate Terry Ellison of the
openoffice.org (ooo) project into the infrastructure community.
Terry was working on migrating the existing wiki and forum
services for that project to ASF gear, but gave up after
being frustrated with his interactions with the ooo community
at the ASF and the infrastructure team in particular.  His
volunteer efforts will be missed.

Mark Thomas successfully migrated the ooo bugzilla instance
to ASF gear.

Mark Thomas also improved our svn traffic banning schemes.

Upgraded our relative state of paranoia following breakins to
kernel.org and linux.com.

Lost a disk in hermes' (mail) zfs array which was subsequently
replaced with an existing spare in the rack.  We need to look into
purchasing another spare of the same specifications for future
disk failures, as there are none left for us to use in the rack.

17 Aug 2011 [Philip Gollucci / Jim]

Setup and racked all the gear for the backup solution at TRACI.
Philip flew down to assist, and we setup a safe deposit box to
store tapes offline.

Harmonia (svn.eu) lost a root disk and reported errors with its zfs array.
The errors were subsequently cleared but we need to look into a
replacement root disk for the one we lost.

After more complaints about delays in the account creation process,
Sam Ruby created a script to automate the input processing for new
requests.  Together with more root people participating in the process,
this has significantly cut response times from several days to just
a few.

Upayavira continues to work on the selfserve infrastructure, which will
someday completely replace the existing account creation procedure.

Uli Stark is finalizing the specs for the EU replacement host for
aurora (www.eu).

Setup for the Openoffice MediaWiki install as well as the Openoffice
forum site has made significant progress.  Terry Ellison has led the
effort for both.

Andrew Bayer approached us with an offer to provide either a cash
donation or hardware donation for additional build slaves.  Serge
explained the situation with targeted donations and our past experiences
with hardware donations at the ASF.

Daniel Shahaf upgraded our svn installation to 1.6.17 on eris, thor,
and harmonia in order to prevent further loss of commit emails.

Cleaned up the root@ alias, adding the President Jim Jagielski to it.

Doubled our available RAM on sigyn (jails) in the hopes of improving
stability of the host.

Bumped the secretary's LDAP karma to a level at least on par with
a PMC Chair.

Niklas Gustavsson was granted infra karma for his work on our Jenkins
build infrastructure.

20 Jul 2011 [Philip Gollucci / Jim]

LDAP + TLS + LPK work is underway.

Justin Erenkrantz stepped down from root@.

Ordered 1y Warranty extension for selene and phoebe (Geronimo TCK builds).

Organizing an infra meetup to coincide with the Surge Conference in
Maryland next month (http://omniti.com/surge/2011/).

Got listed in SORBS again.  Subsequently filled out the delistment forms
so all is well again.

Brought modules.apache.org in-house to better deal with the spate of XSS
vulns.

Started coordinating with existing openoffice.org sysadmins to plan for
eventual service migration.

Part of the backup system order has arrived in Florida.  The remainder is
due to arrive by the end of the week.

Arranged for travel from VA to FLL for Philip to help out with the racking
and safe deposit box for the backup system.

Work on upgrading id.apache.org to handle acct creation is progressing.

Uli Stark specced a replacement host for aurora (eu websites).

AI: Sam to follow up on Philip's missing credit card

15 Jun 2011 [Philip Gollucci / Jim]

Upgraded MoinMoin wiki to 1.8.8.

Mark Thomas Upgraded Jira to 4.3.4.

Niklas Upgraded Jenkins to 1.413 and moved it to a more stable URL
(removing hudson from the URL).

Took another whack at cleaning up /dist dirs.  Current status is appended
to the end of the report.

VP visited OSUOSL to survey the racks.

Looking to purchase additional RAM for sigyn (which keeps panicing
because of lack of RAM).

More XSS vulns reported against modules.apache.org, which we don't
physically host.

Ran into a snag while placing the order for the replacement backup
system.  At this point we need the treasurer's sign-off on a Bank &
Trades agreement with Intervision who will subsequently provide us
with Net-30 terms for the order.

OSUOSL has notified us they are running low on cooling capacity, which
may affect our ability to host new machines at their datacenter.

OSUOSL wants us to purchase a switch to rationalize their cabling
system for our 3rd rack.  We plan to accommodate them once the backup
system has been purchased.

We are purchasing a warranty extension contract from Dell for selene
and phoebe, our Geronimo TCK build hosts in Traci.

-----

Status of PMCs/podlings that had not completed /dist clean-up by third
reminder

Ignored all three e-mails to the PMC:
 TLPs:
   buildr, cassandra, click, couchdb, logging,
   spamassassin, synapse, tcl, tiles

 Incubator (Graduated):
   buildr, river

 Incubator:
   deltaclound, manifoldcf, olio, vcl, whirr


In-progress:
 chemistry - partially complete, dotcmis 0.1 needs to be removed
 empire-db - partially complete, 2.0.7 needs to be removed
 ws - partially complete, axis-c 1.5.0 needs to be removed
 santuario - argued, did nothing
 openwebbeans - argued, did nothing


Fixed and confirmed on third reminder:
 abdera, bval, cocoon, esme, hbase, hive, libcloud, nutch, oodt, perl,
 portals, qpid, roller, shiro, thrift, uima, wink, xmlbeans,
 xmlgraphics

Should not have been on third reminder list:
 jackrabbit

19 May 2011 [Philip Gollucci / Jim]

Migrated people.apache.org (aka minotaur) to HP gear.

Discussed changing our Dell rep to a reseller better suited
for our business.

Received Technologent invoice via OSUOSL.

Started the initial steps of transferring ownership of
jini.net to the ASF.

Tony Stevenson was brought on board as our third contracted
(part-time) sysadmin.

Setup ACLs to allow PMC members to browse the archives of
their own private lists.

Setup infrastructure for storing PGP fingerprints in LDAP.

Discussed the addition of a benchmark-running host for projects
to use, particularly lucene.

Continued pursuing the offenders on the XSS list reported to us
by security@ last month.

Scheduled an infra meetup to happen at the Surge conference
at the end of September.

Investigating some uptime problems with sigyn (tlp zones).

Received a pair of D53J JBODs from Silicon Mechanics, with
a partial refund (directed to treasurer@) for delays and lower
performance drives.

Addressing management of dist/ trees as requested by Greg Stein

It was observed that some account creation requests were not done for over a month. Prior experience was that accounts were processed weekly.

Should not happen again.

20 Apr 2011 [Philip Gollucci / Jim]

Cancelled our incorrect order for warranty support with Dell.
The order was subsequently replaced with a lower-cost version ($1000),
covering the same equipment for a 1y term.

Sent a purchase order to Technologent for 1y warranty support on thor
(confluence) and gaea (zones).

Purchased and received a Dell r210 for $1700 for managing our VMWare
VSphere installation.

Corrected our contact information with Dell for the umpteenth time.

Had OSUOSL deal with the mis-shipment from HP.  Correct replacement parts
are expected to arrive soon.

Provided a little mod_rewrite magic to simplify our main webserver
config.

Upgraded Confluence in advance of a public security notice.

Purchased a pair of Intel nics: 1 to go into the HP gear and 1 for
emergencies.

Tweaked the network config on eris (svn) and eos (websites) to alleviate
chronic downtime problems.

Made an aborted attempt to upgrade minotaur (people) to new gear.  Will
be repeating that process shortly, this time using our HP donation as
the target host.

Mohammad Nour was granted karma to help Paul Davis sort out git hosting.

Received a comprehensive XSS vuln report for several services we offer.
Still working through the list.

Offered Ryan Pan infra-interest karma for his penchant for running
vmstat on minotaur (people).

Renewed our service warranty with Dell for 2y covering hermes (mail) and
eris (svn). Price: $1200.

Sorted out the confusion regarding the iLO license for the HP gear.

pear.apache.org was requested by a couple of PHP projects, it is now
live ready for projects to put PEAR releases on.

Citing a lack of time, Philip, the current VP, asked for someone
to be appointed by Jim, president.  It seems unanimously agreed, at
this time, this is not in the best interest of the foundation or the
infrastructure team.  Instead, some of this role's responsibilities,
which are too much for a volunteer over the long run, are going to be
doled out to a new part time paid contractor and the current full-time
staff.  Its hoped this will reduce the workload of the role to about
~5hrs/week. This is reasonable for an active volunteer.  We will
re-evaluate this after the new position is up and running.  Philip
is able to continue with the VP position for now with this reduced work
load.

16 Mar 2011 [Philip Gollucci / Jim]

Upgraded ALL of our FreeBSD hosts to 8.2-RELEASE except for minotaur,
which is slated for replacement later this week.

Paul Davis was granted infra karma for his work on git hosting.

Reigned in the use of Jira accounts associated with mailing lists,
predominantly for security reasons.

Submitted a budget to the budget committee for 2011.

The long-awaited gear from HP has arrived at OSUOSL.  We are in the
process of having it racked and brought online in our new 3rd rack.
(Unfortunately some of the drives are incompatible with the hosts,
so we are awaiting additional gear from HP to resolve this).

Migrated user .forward files into ldap.  Ldap now is authoritative
for forwarding addresses.

Rationalized much of our vhost config for the tlp websites using
mod_vhost_alias.

Initiated a general cleanup request to tlp's with large /dist/ directories.

OSUOSL has upgraded our bandwidth cap to 50mbps inbound 100mbps outbound
(up from 10 and 50 resp).

Our expected (and paid for) arrays from Silicon Mechanics are delayed
3 weeks pending arrival of the requested Hitachi drives.

Upgraded our svn servers to deal with announced DoS vulnerability.

Spoke with one of our Dell reps about pricing inconsistencies in our new
service contract.  It should be resolved in the near future (in our favor).

Arranged for an updated quote from Technologent for service on our remaining
Sun gear.

16 Feb 2011 [Philip Gollucci / Jim]

Renewed service contract with Dell for 1 year regarding baldr
and our 2 PowerVault arrays.

Received a VAT invoice for ~ 900 EU for the array we recently
shipped to SARA.  Awaiting a wire from the treasurer for payment.

Specced a pair of D53J JBODs from Silicon Mechanics.  Awaiting
a wire from the treasurer for payment.

Started discussing next year's budget.

We now have a 3rd rack to use courtesy of OSUOSL.

Reworked the asf-do.pl script to overcome issues with opie.

Shut down the portals zone for security issues.

Deployed ckl- "CloudKick logging tool" to all our FreeBSD hosts.
See http://www.apache.org/dev/ckl.html for details.

NERO, our network provider at OSUOSL, discovered an open HTTP proxy
on one of our hosts.  Upon further investigation we closed several
other outstanding security issues and removed root access from those
responsible for the poor setup.

Paul Davis continued his work on git hosting, setting up a mailer for
testing.

Daniel Shahaf was promoted to root@ for his outstanding work in several
areas.

Upgraded Jira to 4.2.x and added GreenHopper (supports agile
development) for all projects. We're also eating our own (fresher) dog
food since Jira now runs on the latest Tomcat 7 release.

19 Jan 2011 [Philip Gollucci / Jim]

Fixed all outstanding issues with the backup system.

Brought erebus online (one of the Dell's purchased last month),
to serve as our main VSphere host.

Setup a test instance of JIRA 4 in preparation for the 3.x to 4.x
upgrade.

Instituted a password policy which locks accounts for 24 hours after
10 failed login attempts.  See

 https://blogs.apache.org/infra/entry/ldap_and_password_policy

for details.


Brought the CMS to a feature-complete 1.x state.  It is now ready
for wide-scale adoption, starting with the incubator; see

 https://blogs.apache.org/infra/entry/the_asf_cms

for details.

Updated our account details with Dell.

Dealt with extended people.apache.org outage during the New Year
holiday.

Dealt with some wiki abuse reports from NERO regarding attachments.
As a result we have disabled the feature across the wiki farm.

Updated the LDAP scripts on people.apache.org to filter out redundant
entries in all "modify" operations.

RMA'd a failed drive back to Silicon Mechanics.

Promoted Daniel Shahaf to enjoy root karma on minotaur (people).

Daniel Shahaf setup our reverse ip zone master for our OSUOSL ip's
with OSUOSL's dns server slaving off that.

Specced a new JBOD array for service at about $7K.

Brought id.apache.org online (props to Ian Boston, Daniel Shahaf, and
Tony Stevenson); see

 https://blogs.apache.org/infra/entry/https_id_apache_org_new

for details.

Confluence upgraded to the latest 3.x version, courtesy of Gavin McDonald.

15 Dec 2010 [Philip Gollucci / Jim]

Ordered a pair of Dell r510's for $24.2K, slated to be used
for jails and database hosting.

Discussed JVM settings on the ofbiz vmware instance with the
ofbiz PMC.

Discussed previously offered HP donation of a pair of servers:
one for www.eu and one for jails hosting.  Agreed to accept
the donation but will not seek to accept hardware donations
in the future.

Had Y! replace a pair of failed disks in minerva (builds). As it's
a RAID 5 array we had to reinstall the OS.

Patched our JIRA installation in response to the recent security
advisory from Atlassian.

Gavin McDonald did an analysis of our backups to date, pointing
out specific backups that are not going thru cleanly.

Bruce Snyder successfully arranged for a VSphere license from VMware.

Discussed fact that Justin Erenkrantz has moved away from the UCI
campus where our backup server is hosted.  The issue needs further
attention going forward as we need someone local who can change out
the tapes and put the used ones into a a safe deposit box.

Paul Davis made significant progress in our pursuit of read-write
git hosting.

It was noted that several projects gave thanks to infra for their work this period. Well done!

17 Nov 2010 [Philip Gollucci / Jim]

We specced and ordered an Opteron-based Dell R515 machine
for virtual machine hosting- cost ~ $12.5K.  A FC card was
thrown in (for reuse of the decommissioned helios array)
at no additional cost.

Chris Rhodes was given root@ karma.

We held a team meeting on Monday Nov 1 during ApacheCon.
Highlights were posted to infrastructure-private@.

Daniel Shahaf was made a member of the Infrastructure Team.

www.apache.org was converted to the CMS with new templates
and stylesheets provided by Chris J. Davis.

HP has offered to supply our replacement servers for SARA.
Tony Stevenson is leading this conversation, as it could save
us roughly ~$30K from our budget.

Discussed the fact that Apple has EOL'd Xserves while we just
recently racked a donated pair of them.

Renewed the service warranty with Dell on baldr (jails) for 1y.

Plan drafted on what it would take for Git to be able to be
used as a project's primary source code repository. Currently
dependent on volunteers driving the effort.

Shane: yay new web site!

20 Oct 2010 [Philip Gollucci / Jim]

Helios (zones) has been decommissioned, replaced by FreeBSD jails
on sigyn.

5 new TLPs were processed: pig, hive, shiro, juddi, karaf.

We still need to purchase an Opteron-based box per the sponsorship
agreement with AMD.

Tony Stevenson fixed our busted zfs-snapshot-to-tape script on bia
(backups).

Upgraded all our Linux hosts to deal with the announced local root
exploit bug.

Bruce Snyder continues to pursue a VMWare VSphere installation for
us.

Coding work on the CMS has completed: for details see

http://www.staging.apache.org/dev/cms.html

We are planning to go "live" during Apachecon once the new templates
and stylesheets are available for www.apache.org.

We received a DMCA takedown notice regarding some content in the
logging and Hadoop wikis.  The PMCs have been notified but have not
reported back on their progress.

Specced an FC card to allow us to reuse the helios array, providing
sigyn with more storage space.  Estimated cost ~ $1000.

The AMD box has not yet been purchased.

AI Jim facilitate the purchase of the box.

22 Sep 2010 [Philip Gollucci / Jim]

Gavin McDonald is in the final phase of migrating all necessary Solaris
zones from helios to FreeBSD jails.

Hudson master has moved to a new machine (aegis) and begun using LDAP,
thanks to Tony Stevenson and Niklas Gustavsson.  Jukka Zitting notes
it may be time to start experimenting with running Hudson slaves on EC2.

Sander Temme is preparing eve (Xserve) for hudson and buildbot usage.

Daniel Shahaf patched the downloads script to deal with a potential
XSS vulnerability.  Unfortunately some stray code wound up in production
due to anakia deployment issues, which took the downloads script down
for several hours.

Sander Striker signed a Dell "letter of liability" for Dutch equipment
purchases for SARA.

Stefan Bodewig was given full infra karma due to his vmgump contributions.

Ulrich Staerk was given full infra karma based on his work on s.apache.org,
jira, and confluence.

We received 2 disks from Silicon Mechanics and replaced a failed disk in
eos(www).  The replaced disk will be shipped back to Silicon Mechanics
under RMA.

We ordered a disk array from Silicon Mechanics to be drop-shipped to Bart
van der Schans in the Netherlands for eventual deployment at SARA.  Cost
was ~$5700.

Started up a project for creating a custom CMS for Apache.  Initially it
will target www.apache.org, with something for people to review around
Apachecon in November.

Gavin McDonald proposed some new equipment purchases to build our our
VM infrastructure.

Upayavira completed the domain transfer for ofbiz.org.

Ari Maniatis is pursuing hosting an svn mirror in Australia.

Don Brown is pursuing the idea of getting support
for the Confluence auto-export plugin.  Go Don!

18 Aug 2010 [Philip Gollucci / Justin]

Daniel Shahaf and Dave Johnson and Niklas Gustavsson were granted
infra-interest karma.

Norman Maurer brought new athena (mx1.us) online.

Sander Temme shipped the pair of Xserves in his possession to OSUOSL.
They have been racked and are currently being configured for use.

Ari Maniatis has been in touch with the University of Sydney for
the purpose of both hosting a svn mirror and to provide facilities
for an Apache conference/barcamp.

We looked into the idea of holding an infra-thon before Apachecon
in November but the timing didn't seem to work out.  We will probably
try again in early 2011.

Began the process of speccing replacement hosts for our EU gear hosted
in SARA.

The replacement host for minotaur has arrived, been racked in OSUOSL,
and is currently being setup by Philip Gollucci.  Note that we are running
out of current capacity (amps) in our racks.

Gavin McDonald reached out to pmcs with zones on helios to tell them about
our migration plans: replacing those zones with FreeBSD jails.  Most have
responded in a timely manner.

Odin has been decommissioned and the vmware instances it hosted for vmbuild
and vmgump have been transferred to nyx.

One of our disks in the brand new eos array was defective.  It's been
taken out of service and is to be shipped back to Silicon Mechanics for
replacement.  We have also ordered a pair of spare drives from them
as well.

We ordered an array from Silicon Mechanics to be shipped to Bart van der
Schans in the Netherlands for ~$6000.  The array is to be part of our
replacement plan for aurora (www.eu) in SARA.

We've moved the Hudson master to a new machine (aegis) and begun
using LDAP for Hudson, thanks to Tony Stevenson and others.

Shane asks when amps shortage will be critical. Philip replies that it's under control.

Approved by general consent.

21 Jul 2010 [Philip Gollucci / Justin]

Philip Gollucci performed a general cleanup of the filesystem
on minotaur (people).

Ok'd Jukka's plan to setup Gerrit for hosting an Apache Lab.

Infra was notified of a compromised gmail account potentially
hacked as a result of the jira hack on Apache.

Joe Schaefer was called for 2 weeks of jury duty and the team
capably picked up the slack in his absence.  Kudos in particular
to Gavin McDonald, Philip Gollucci, and Paul Querna.

The new replacement machine for eos is currently online and
serving traffic for www.apache.org and mail-archives.apache.org.
Migration of the moin wiki will be forthcoming soon.  Thanks to
Paul Querna and Philip Gollucci for doing the bulk of the setup.

The new replacement machine for athena is setup and should be
brought online as mx1.us shortly.  Thanks to Norman Maurer and
Philip Gollucci for doing the bulk of the setup.

Received a "paid invoice" notice from Network Depot for the
SonicWall.

Held some discussions between Mark Thomas and Philip Gollucci
about how to set-up baldr, the machine destined to host
issues.apache.org.

Ordered a Dell R410 for ~$5000 to serve as a replacement for
minotaur (people).

Tony Stevenson made a few modifications to our LDAP tree to
better service Hudson and similar apps.

16 Jun 2010 [Philip Gollucci / Justin]

Sander Temme has been busy configuring the pair of Apple-
donated Xserves prior to shipment and installation at OSUOSL.

Gavin McDonald did some repair work on our backup systems.

Discussed the future of odin (vmware) while Mark Thomas
replaced a failed disk in it.

Mark Thomas was promoted to root@.

Brought an uri-shortening service online: http://s.apache.org/details
which takes advantage of LDAP + the Thawte-supplied wildcard cert.

No progress was made in bringing up the purchased replacement
host for eos.

The new machine Aegis was brought into service to replace Ceres as
the Buildbot Master, Ceres continues to be a Buildbot Slave.
Work is underway to move Hudson.zones Master to Aegis also.

Sebastian Bazley has taken up the charge to address a few key
crons that require access to private svn urls.

19 May 2010

Replaced a S300 software raid card with Perc 6i card in
aegis (builds).

Enforced a hard May 1, 2010 deadline for all admins to adopt
OPIE on all Linux and FreeBSD hosts at Apache.

Gavin McDonald upgraded Confluence to 3.2 - the latest available
version.  Dan Kulp was kind enough to patch the autoexport plugin
this time round, but we need to take another serious look at phasing
out confluence as a CMS, perhaps replacing it with Day's CQ5, in the
near future.

Initiated periodic password cracking program for our most sensitive
passwords, particularly LDAP passwords.  Early results identified
some 60 accounts vulnerable to dictionary-style attacks, and those
users have been contacted.  Also notable was the identification of
FreeBSD crypt as being a superior storage format for hashed passwords
as opposed to SSHA, so we are in the process of phasing out SSHA
for LDAP passwords.

We are in the process of compiling a list of accounts with security
issues that are no longer reachable via their apache.org email address.
Those will be the first group of accounts we close out.

We have cleaned up the root@ alias addresses and synced them with
committee-info.txt.  Notable changes were the removal of Roy Fielding,
Ted Husted, Joshua Slive, and Erik Abele, and the addition of Gavin
McDonald, Tony Stevenson, and Norman Maurer.

Due to port restrictions and lack of console access to Y! machines, the
Buildbot master was moved to the newly brought online 'aegis' builds
machine. The old host 'ceres' remains as a buildbot slave. Hudson master
is due to move across from Hudson.zones shortly.

Noirin Shirley was voted onto the Infrastructure Team for her editorial
work on the infra blog.

21 Apr 2010 [Philip Gollucci / Justin]

Norman Maurer upgraded several Solaris machines to the latest
available version.

Purchased 48GB RAM from Crucial slated for installation in
the upcoming eos (websites) replacement host.

Had OSUOSL ship Ken Coar the failed fireswamp (apacehcon) host.

Installed new dell 48 port switches and moved all public dracs
at OSUOSL to the private network.

Ordered a Dell R410 (slated to replace eos) with external SAS
card for $3800.

Ordered a 12-disk JBOD from Silicon Mechanics for $6400.

Ordered a Dell R210 (slated to replace athena) for $2200.

Ordered a SonicWall SRA 2000 for $2300 to replace crappy VPN device.

Took Paul Querna up on his offer for Cloudkick service for
host/service stats.

Ruediger Pluem was granted full infrastructure karma.

Working out details of an Apple donation of a pair of Xserves.

Brutus (issues, cwiki) got hacked.  The details are available
at https://blogs.apache.org/infra/entry/apache_org_04_09_2010

17 Mar 2010 [Philip Gollucci / Justin]

Philip Gollucci signed the annual service contract with Sun/Technologent
for ~$2K.

2 SSD's installed into eris(svn) to boost performance.  Between our EU and
US svn servers we currently handle over 6M hits / day.

RAM ($1200) installed into eris(svn) and brutus(jira,bugzilla) to boost
performance.

Website traffic to our tlp's and www.apache.org is hovering around 10M hits a day.

Spam traffic continues to fall: we are currently seeing only about 600K
connections per day, down from its peak of 1.5 M connections a day in 2006.

Philip Gollucci worked some magic and has upgraded all of our FreeBSD boxes
to 8.0-stable.  The old NGROUPS_MAX problem that previously limited users to
16 unix groups is a thing of the past.

Discussed and created a budget for FY 2010 worth ~$250K.

In discussions to purchase an Xserve from Apple for ~$6K.

Discussed an offer from a third party to host a virtual machine for us.
Ultimately the offer was declined.

Discussed plans for migrating Solaris zones to FreeBSD jails.

Aurora (websites) is down for an extended period of time until we can determine
whether or not to replace it immediately or have the machine serviced by a
Sun tech.

Gavin McDonald specced another dell for use as a build farm server for ~$6K.

Purchased a pair of Dell 5448 48-port managed switches for ~$1600.

Brad Davis of FreeBSD infrastructure subscribed to infra-private@.

Aristedes Maniatis was granted infrastructure-interest karma.

Geir asked about Technologent invoice, Phil to follow up.

SVN performance improvements are much appreciated!

17 Feb 2010 [Philip Gollucci / Justin]

Turns out the SAS card we ordered for eris (svn) is totally unsupported
by FreeBSD.  We have sent it back to Newegg and ordered a known-to-be-
supported Dell card instead, with additional cables from Provantage,
at a total cost of ~$350.

After several months of trouble with the nyx (vm's) raid card / array,
upgrading "everything" seems to have resolved the random disk failures.

Jukka Zitting has installed gerrit on git.apache.org for preliminary
(semi-authorized) testing of native server-side git support.

Mass mailed hundreds of committers about insecure / oversized items in their
home directory.  It is not encouraging to see the same people on the lists
month-after-month.

Philip Gollucci negotiated new terms for the Sun support contract (to be
signed later this month).

Tony Stevenson and Chris Rhodes engineered the successful migration of Unix
and svn group data into LDAP.

Board agrees with the approach of temporarily locking out of committers who aren't paying attention to security notices.

20 Jan 2010 [Philip Gollucci / Justin]

De-racked 4 decommissioned machines: freyr, freyja, idunn, fireswamp.

Introduced ckl, a communication tool for distributed sysadmin teams,
into our workflow.

Tony Stevenson acquired a 2 year wildcard SSL certificate from Thawte.
All top-level project websites, including www.apache.org, are now
available over https.

Sketched out some preliminary plans for migrating the zones on helios
to jails on sigyn.

Replaced a failed disk in hermes(mail).

Ordered a pair of SSD's and a JBOD enclosure to house them from Silicon
Mechanics for ~$2400.  The order is expected to ship before Feb 1 and
will be used to beef up performance on eris (svn).

Ordered a corresponding PCIe 2xSAS card for eris (svn) to communicate with
the SSD's for ~$400.

Dan Kulp was granted root on brutus to assist with confluence admin.

16 Dec 2009 [Philip Gollucci / Justin]

Sander Striker ordered a disk replacement for nike (mx1.eu) and Bart
van der Schans installed it.

Began testing a searchable interface for private mail archives courtesy
of Chris Rhodes.

Started sending out periodic notifications to users with oversized home
directories, insecure permissions or private keys.

Improved our DNS configuration with input from Surfnet.

Ordered 6 SCSI drives to serve as replacements for various failed drives.

Tony Stevenson specced an HP machine slated to replace the aging minotaur
(people).

Tony Stevenson expanded our LDAP footprint to now be usable for logins
with all subversion repositories.

In contact with Sun to replace a failed controller in helios's array.

Philip Gollucci upgraded loki to FreeBSD 8.0.

Philip Gollucci patched FreeBSD on minotaur(people) to deal with a few
security advisories.

Martin Cooper and Davanum Srinivas dealt with a mass influx of spam
accounts into Confluence.

18 Nov 2009 [Philip M. Gollucci / Justin]

Philip Gollucci was appointed VP Infra, replacing Paul Querna.

Brian Fox was added to the Infrastructure Team.

A podling's distribution directory was compromised due to lax
permissions (our hacker from August had installed a backdoor
script in a user's public_html directory).  The offending
material was removed prior to being distributed to the mirrors.
A general cleanup of home-dirs with lax permissions was also
executed.

One of our virtual hosts was hacked, most likely due to a poor
choice of root passwords (and leaving remote-root-logins enabled).
The virtual host was summarily nuked as a result.

There was some confusion surrounding the creation of a DNS entry
for the upcoming Asia Roadshow event.

A general question was raised by Philip regarding how much Infra
has spent of its budget so far.  Advice from the Treasurer would be helpful.

Sander Striker is pursuing a replacement for clarus.

Infra held a face-to-face meeting at ApacheCon.  Notes were posted
to the infra-private list by Upayavira.

Tony Stevenson has started tackling LDAP again.

Chris Rhodes has been testing a service to provide web access to our private
archives to members.

The PDFBox TLP was created.

Subversion's repo has been migrated to the ASF repo.

Chris J. Davis began hacking up a new website for Apache.

Gavin McDonald continued to extend our buildbot offerings.

We seem to have unresolved issues with the tape library on bia (backups).

We're in discussions with HP regarding a hardware donation/loan.

We're in discussions with Thawte regarding SSL certificates.

Two disks have failed in nyx's (virtual machines) array.  Tony Stevenson
has partially addressed the situation, but we need to purchase replacement
drives ASAP.

One disk has failed in eos's (websites, wiki) array.  We're waiting for
a replacement drive to be purchased.

We're stalled on what to do about switch replacements.  The current (24 port)
switches are essentially full, and we need to purchase a switch or two with
more ports to replace them.

21 Oct 2009 [Paul Querna / Justin]

Moin wiki post-upgrade issues resolved.

Ruediger Pluem developed a patch for httpd to mitigate the ddos
issue plaguing brutus (jira,bugzilla).

Norman Maurer upgraded eos (wiki,websites) Solaris 10 u8.

Issues with the sync scripts were resolved by Tony Stevenson.

Athena (mx1.us) was brought back online by Philip Gollucci after
osuosl replaced the power supply.

Phillip Gollucci developed an automated system for managing crons.

Paul Querna purchased a Dell Poweredge R610 for ~$4400 slated
as a replacement for helios (zones).

Justin Erenkrantz dealt with some IO issues on helios by shuffling
a few zones around.

Philip Gollucci cleaned up a few unused root accounts.

(Temporarily) suspended a committer's privileges pursuant to a board
request.

General interest was expressed in participating in a cross-foundation
infrastructure list hosted by osuosl.

Gavin McDonald continued his work on the buildbot-based farm.

Gavin McDonald spec'ed a Dell machine to potentially be used as an
Australian svn mirror.

Philip Gollucci upgraded viewvc to 1.1.2.

Paul Querna ordered a fiber channel card for ~$450 to complement the
ordered Dell mentioned above.

23 Sep 2009 [Paul Querna / Justin]

 o Hacking Incident
 = First major security incident on our infrastructure since 2001[1].
There are always possible things to change, but we handled it well,
and have rebounded with one of the most active months in recent Infra
memory.
 = Initial Report:
<https://blogs.apache.org/infra/entry/apache_org_downtime_initial_report>
 = Full Report:
<https://blogs.apache.org/infra/entry/apache_org_downtime_report>

[1] http://www.apache.org/info/20010519-hack.html

o SARA/SURFnet hardware moving to new location inside same data
center. bvds heading up the local team on the ground.

o Added Gavin as a part time contractor.

o SvnPubSub developed
<https://svn.apache.org/repos/infra/infrastructure/trunk/projects/svnpubsub/svnpubsub.py>
 = Notifies services of changes to the Subversion Repositories
 - Twitter Bot Online <http://twitter.com/asfcommits>
 - Testing SvnWcSub, to keep a working copy in sync with a master
repository.  Will replace /dist/ and most websites distribution in the
long run, which is currently being done with rsync over SSH.

o mod_allowmethods developed
<https://svn.apache.org/repos/infra/infrastructure/trunk/projects/mod_allowmethods/mod_allowmethods.c>
 = Disabled all non-GET requests on most VHosts for *.apache.org

o mod_asf_mirrorcgi developed
<https://svn.apache.org/repos/infra/infrastructure/trunk/projects/mod_asf_mirrorcgi/mod_asf_mirrorcgi.c>
 = Hack to map our hundreds of identical download.cgi scripts to
invoke the same CGI directly.

o Disabled CGI support on most vhosts for *.apache.org

o MoinMoin wiki upgraded from 1.3.x to 1.8.x

o FastCGI via mod_fcgid is now used for the wiki and mirrors.cgi

o Nyx setup with VMWare to host various VMs.

o Enabled OPIE on Brutus & Nyx.

o Dealt with ZFS issues on minotaur(people). Had to rebuild the array
after a disk died.

o Replaced hyperreal.org with no-ip.com for one of our Slave DNS Servers.

o Coordinated a security fix for Bugzilla.  We were contacted ahead of
time by Bugzilla developers, and given a patch to apply before they
made a public disclosure.  Mark has since updated us to their new
release.

o Requested new Solaris 10 OS subscription keys from contact at Sun.

o thor brought online (to host svn-dist and search services)

o eos, bia, thor, and aurora upgraded to Solaris 10 u7.

o New sshd_config using SSH keys stored in SVN for infra members has
been deployed on most machines.

o In progress of removing unneeded access & sudo to several machines
(hermes, brutus, minotaur)

o promoted norman,pctony,gmcdonald to root@ on all fbsd boxes

o zfs is declared production ready in 8.0-RELEASE when it comes out

o Minotaur(people)
 = Updated to 7-stable
 = Updated ports:
  - hpn-ssh - is now a port
 = Updated people.apache.org, www.apache.org
   httpd 2.2.11 -> 2.2.13
 = converted to dns/bind96
 = setup no-ip as dns slaves
 = started ipfw->pf conversion

o hermes(smtp)
 = Updated to 7-stable

o hercules(mx2.us)
 = Updated to 7-stable
 = setup

o eris(svn.us)
 = Updated to 7-stable
 = updated ports
  - serf is now a port
 = updated svn 1.6.1 -> 1.6.5
 = updated httpd 2.2.11 -> 2.2.13
 = updated svnmailer
 = attempted viewc update
  - fixed viewvc file contents bug

o harmonia(svn.eu)
 = Updated to 7-stable
 = updated ports
  - serf is now a port
 = updated svn 1.6.1 -> 1.6.5
 = updated httpd 2.2.11 -> 2.2.13
 = updated svnmailer
 = attempted viewc update
  - fixed viewvc file contents bug

o athena(mx1.us)
 = Updated to 7-stable
 = replaced dead disk ad4 [osuosl]
 = replaced doa disk again [osuosl]
 = updated httpd 2.2.11 -> 2.2.13
 = updated ports

o nike(mx1.eu)
 = Updated to 7-stable
 = updated httpd 2.2.11 -> 2.2.13
 = updated ports
 = updated lom [osuosl]

o loki(tb,ftp,cold spare)
 = Updated to 7-stable
 = updated ports

19 Aug 2009 [Paul Querna / Justin]

Sander Striker is still looking into purchasing a pair of replacement
drives for aurora (.eu website mirror).

Installed the recently purchased r710 Dell box as nyx. Tony Stevenson
set up VMware ESXi on nyx and created a few virtual hosts, one XP based.

We're looking to return the four IBM x345's on loan and awaiting
feedback from IBM (contact point Sam) on where the machines should
be shipped.

Fireswamp's (apachecon) power supply failed.  The machine is too
old to try and service with replacement parts; we will create a
VMware host on nyx, restored from fireswamp backups, as a replacement
soonish.  In the meantime we're redirecting all www.apachecon.com
pages to us.apachecon.com.

Attempts to round up volunteers for a softball game against the board
haven't met with much success, largely due to the short (and as yet
unknown) timeframe of the board retreat.  We may need to reschedule
the game to coincide with some other Apache related get-together sometime
next year.

Dealt with some unexpected/unplanned networking issues on our build
servers in the Yahoo! data center.

Philip Gollucci replaced the over-capacity gmirror-based /x1 array on
people.apache.org with a shiny new raidz2 zfs-based array.  After a week
or so of random zfs failures, things seem to have stabilized since upgrading
the host (minotaur) to 7.2-STABLE.

Tony Stevenson sent out another request to PMC chairs to ensure the
asf-authorization file is compatible with our plans to migrate group
data to LDAP (aka phase 2 of the LDAP migration).  Compliance is still
a mixed bag, and we will likely just do the deed for the non-compliant
PMCs.

In the early discussion phase for a potential new Microsoft-based build
farm.

Gavin McDonald notes Buildbot can now automatically deploy snapshots to
repository.apache.org via nexus or to a custom download page.  (See
http://ci.apache.org/projects/ofbiz/snapshots/ as a working example.)
Reminder that Buildbot is geared up to produce RAT reports for any project
that wants it.

Paul Querna notes problems with the Confluence auto-export referencing
Javascript and CSS hosted on brutus caused some downtime for other services
on brutus (JIRA, Bugzilla, etc).

15 Jul 2009 [Paul Querna / Justin]

Justin Mason arranged for a free license to the Spamhaus DNSBL
service.

Tony Stevenson and Chris Rhodes began testing phase 2 of LDAP
service on harmonia (svn mirror).

We lost another drive in aurora (.eu website mirror).  Sander
Striker is investigating a pair of replacement drives.

Gavin McDonald was voted into the root@ club for his interest
in FreeBSD maintenance.

Purchased an r710 Dell box for VMWare hosting for ~$5K.

Purchased RAM SCSI Card and additional drives from NewEgg
for ~$1100.

Mads Toftum upgraded aurora (.eu website mirror) to Solaris 10 5/09.

Don Brown upgraded our Confluence installation to 2.10.3.

Gavin McDonald is organizing the return of the four IBM x345s
on loan.

We continue to experience availability issues with eris (svn)
due to zfs issues with FreeBSD.

VMWare Workstation on odin was upgraded to 6.5.

17 Jun 2009 [Paul Querna / Justin]

Tony Stevenson completed phase 1 of the LDAP migration, migrating user
accounts on people.apache.org into LDAP.

Sander Striker promised to someday order a replacement disk for aurora
(websites) and have it shipped to Bart van der Schans in the Netherlands.

The SAS cable we RMA'd back to Provantage was returned back to us as an
invalid RMA.  We have procured an UPS shipping label from Provantage and
are attempting to resend it.

Infrastructure has made a request to PMC chairs to help us with Phase 2 of
the LDAP migration: bringing groups into LDAP.  The majority have complied,
while a large number of PMC's have yet to do so.

IPv6 support was disabled until we are better positioned to be able to
monitor and maintain it.

Henk Penning continued to keep a careful eye on the mirroring system.

Brian Fox continued his support for the Nexus installation at
repository.apache.org.

Mark Thomas upgraded our Bugzilla instances to the latest version.

Chris Rhodes was voted in as a new Infrastructure committer.

Gavin McDonald continued to enhance our buildbot service at ci.apache.org.

20 May 2009 [Paul Querna / Justin]

Shipped errant SAS cable back to Provantage.

16 new backup tapes ordered, charged to ASF credit card.

Began organizing volunteers for moving our gear in SARA. We still
need to purchase a disk replacement for aurora (websites).

Philip Gollucci upgraded all of our FreeBSD servers to 7.2-RELEASE,
based on a central machine, tb.apache.org, for compiling and pushing
out the software.

Paul Querna upgraded subversion to 1.6, splitting out the remaining
private portions of the "asf" repository into a new "infra" repository.
Note to officers: the new location containing the asf-authorization file is
https://svn.apache.org/repos/infra/infrastructure/trunk/subversion/authorization

Sent an opt-out letter for Phorm scanning on apache.org and related domains.

Discussed the lack of progress with respect to upgrading the Confluence
auto-export plugin to be compatible with recent releases of Confluence.
Adaptavist again claims to be nearly finished with the work (ETA 2 months),
but if the situation hasn't been resolved in that timeframe we will need to
pursue other options, including migrating all of cwiki.apache.org to the
latest version of moin-moin.

Updated the Release FAQ, with input from several contributors.

15 Apr 2009 [Paul Querna / Justin]

Submitted a budget to the budget committee.

Purchased a 1-year extension to the Technologent service contract for
our Sun gear at $4K.

Set up a blogging infrastructure for projects to use at blogs.apache.org.

Henri Yandell merged the Click, Cayenne, and Roller jiras into the main
jira.

Our Dotster account was hacked (again), the impact of which was to see
our DNS glue records for apache.org changed for a brief period.  Discussions
to change our registrar are ongoing.

Migrated our core mail server (hermes) from our last IBM x345 in service
to one of our new Dell 2950's.  The old gear will be deracked and sent back
to IBM along with the other x345's.

Set up Geographic DNS for svn.apache.org to distribute traffic between
our master server (eris) and our European mirror (harmonia).

Purchased a Linksys RV042 VPN device for $138.

Work on the new git.apache.org continues, principally being performed
by Jukka Zitting and Grzegorz Kossakowski.

Work on LDAP at the ASF continues, being driven by Tony Stevenson.

Work on the Buildbot installation continues, being driven by Gavin McDonald.
We've set up a new mailing list for build services at builds@apache.org.

Norman Maurer upgraded our backup server (bia) to Solaris 10u6.

Lost a disk in both aurora (websites) and minotaur (people).

18 Mar 2009 [Paul Querna / Justin]

Replaced a failed disk in eris (svn).

Discussed what to do about disused accounts on people.apache.org.
No actions taken at this time.

Replaced a failed disk in eos (websites).  Norman Maurer rebuilt
the array to be based on raidz2 instead of raidz1, and upgraded
the operating system to Solaris 10u6 + patches.

Gavin McDonald convinced Yahoo! to open up the buildbot server
port on ceres (builds).

Discussed a budget - the general consensus is that the infra
budget should be $150K-$160K per year, but folks on the budget
committee are looking for a more detailed breakdown of the
hardware portion.

Discussed an infra meetup between various open source
organizations, with the intent to fund travel should it come to
fruition.

Norman Maurer convinced the SpamAssassin PMC to move their IO-
intensive jobs to their zone on odyne.

Shipped the errant SAS cable back to provantage.com.

git.zones.apache.org was setup on odyne for Jukka Zitting to work
on.

Coordinated the move of planetapache.org to planet.apache.org/committers,
with the help of several member volunteers.

Henri Yandell carefully upgraded all of our Jira installations to
3.13.2.

Sebastian Bazley continued his work validating foundation records.

Henk Penning continued his work handling mirror requests.

Justin confirms that the $150-160k figure includes staffing

18 Feb 2009 [Paul Querna / Justin]

Philip Gollucci successfully upgraded people.apache.org to
FreeBSD 7.1.

Paul Querna purchased cables, SAS card, and an array for
replacing the existing array on people.apache.org, for ~ $4500.
Unfortunately the cable provider sent us the wrong cable, so
we had to order a replacement.

Paul Querna and Sander Striker coordinated with our datacenter
providers to enable IPv6 routing for our websites.

Tony Stevenson set up backup services for the Apachecon hosts.

Paul Querna restructured the apache.org DNS zone to be generated from
a script.

Gavin McDonald continues to work on a buildbot installation on 2
of our new Yahoo! hosts.

Gavin McDonald pushes for a budget for infrastructure by the end of
February.

Brian Fox set up a Nexus instance on the repository.zones.apache.org
zone to facilitate moving some of our maven infrastructure to
repository.apache.org.

Paul Querna renewed the myfaces.com DNS for 2 years.

Confluence is still mired in the futility of hoping someone will fix
the autoexport plugin, despite many promises from Atlassian.

Mads Toftum and Norman Maurer are discussing how best to provision
thor for the services we will place on it.

Eos (websites) lost a ZFS disk which took it out of commission for a
few days.  Services were moved to aurora to prevent a significant
outage.

Infrabot was significantly enhanced, adding many new collaboration
features.  It is worth noting that infrabot is now on twitter at
http://twitter.com/infrabot , which we expect to use for service
announcements and outreach to folks too busy to follow the
infrastructure mailing lists.

21 Jan 2009 [Paul Querna / Justin]

Discussions about what to do with thor now that the new
disks are installed continue.

The new Yahoo! machines are online and being configured
by Nigel Daley and Gavin McDonald with build services.

Henri Yandell continues to wrestle with a Jira upgrade.

Purchased a new array, cables, and SAS card for minotaur
(aka people.apache.org) for $4700.

Brought loki (hot-spare) online with FreeBSD 7.1.  We're
planning to do the same for new hermes (mail) next month.

Tony Stevenson continues to lead the LDAP deployment discussions
on infrastructure-dev@.

Two TLP migrations, buildr and camel, were completed.

Roughly 2 dozen new accounts were created.

Philip Gollucci is planning to upgrade minotaur (people.apache.org)
to FreeBSD 7.1 in preparation for the arrival of the forementioned
array.

Paul clarified that the original intent for Thor was as a build server, but the new build machines from Y! turned out to be a better match due to bandwidth constraints and location.

17 Dec 2008 [Paul Querna / Justin]

6 SAS disks were purchased for use in thor: cost $2000.
Discussions for repurposing thor as a database/blog server
once the disks are installed are ongoing.

4 Yahoo! build-farm machines have been made available to the
infrastructure team for configuration. Discussions for
requesting another hired sysadmin on a part-time basis
to provide centralized build services are underway.

Henri Yandell continues to wrestle with a jira upgrade.

Bart van der Schans visited the SARA colo again to help
with maintenance on harmonia (svn mirror) and nike (mail).

Paul Querna continues work on the new www.apache.org/dev/stats pages.

Sander Striker and Paul Querna are pursuing an IPv6 allocation
for apache hosts.

Adaptivist has offered to continue James Dumay's work on the
autoexport plugin (which still prevents us from upgrading Confluence).

Atlassian has offered to host an SVN mirror for providing more
Fisheye sites.

Renewed the spamassassin.org domain for an additional 2 years.

Philip Gollucci somehow figured out how to upgrade our two frontline
mailservers, nike and athena, to FreeBSD-7-STABLE.

Renewed the SSL certificates for svn.apache.org and issues.apache.org.

Upgraded all ~800 apache mailing lists to support the emerging SRS
and BATV specifications.

Tony Stevenson continues to pursue LDAP deployment at the ASF.

Chris J. Davis was brought on as a new infrastructure committer.

Jukka Zitting was added to the infrastructure team for his work
on git mirrors.

The attic pmc infrastructure was set up.

Three TLP migrations - abdera, qpid, and couchdb - were completed.

We discussed the importance of establishing a budget before we can evaluate the request for a sysadmin.

19 Nov 2008 [Paul Querna / Justin]

Yahoo! talks about a build farm are proceeding at our normal slow and steady
pace.

T-shirts were distributed to team members.

Held a face-to-face meeting at Apachecon with irc logging.

Henk Penning and Gavin McDonald led the effort to clean out the mirror system,
successfully purging over 6GB of redundant artifacts.

Made progress with bringing Confluence up-to-date, by first recognizing its
importance to current operations and then working with James Dumay on irc to
bring the autoexporter plugin up to date.  We expect to make more progress
once James' patch has been tested and incorporated into the autoexporter tree.

No progress has been made in replacing hermes (mail). We're looking forward to
tackling that when FreeBSD releases 7.1, with expected improvements in its zfs
support.

Norman Maurer is still waiting for us to order disk replacements for thor
(builds).

Took a few failed cracks at upgrading nike (mail) to FreeBSD 7 with Bart van
der Schans' on-site help.  Will continue tackling this problem before pursuing
FreeBSD upgrades to other x2200's in service.

Offered Chris Davis committership under infrastructure's purview. He has
accepted the offer.

Tony Stevenson-led investigations into deploying LDAP at the ASF have gained
some speed.  We've talked with the Apache Directory project a bit about
potentially using their software for this purpose.

Paul Querna provided the Syracuse researchers with a filtered dump of our
public svn tree, and will be coordinating with them to determine the best
way of keeping their soon-to-be-published copy of our repo current. We also
pointed them at the rsync location of our raw mail archives.

Justin Erenkrantz purchased a few new SSL certs to replace those that were
scheduled to expire next month.

Paul notes that we may need another part time administrator next year. Don't feel comfortable tasking that to our existing administrators as this requires root.

The infrastructure team is looking into Covalent's tools as time permits.

15 Oct 2008 [Paul Querna / Justin]

Yahoo! talks about a build farm are ongoing.

No progress has been made in replacing hermes (mail).
We're looking forward to tackling that when FreeBSD
releases 7.1, with expected improvements in its zfs support.

Fail2Ban, a tool for guarding against ssh scans, has been
implemented on people.apache.org by Tony Stevenson.

Some issues with Confluence, particularly its license,
have arisen. We are currently stuck between a rock
and a hard place in that we cannot upgrade it without
breaking the autoexport plugin, which is a core feature
of the service.

Disk replacements for thor (builds) have not been ordered,
pending a callback from CDW.

Mark Thomas was granted infra karma for his work on Bugzilla.

Discussions about creating a vmware instance for Windows
are ongoing.

Our svn servers are now servicing over 2M requests per day,
which is a doubling of activity over the past year.

Our automated checks against wiki spam have blocked roughly
100K attempts over the past year.

Jukka Zitting continues his impressive work experimenting
with git at Apache on infrastructure-dev@.

17 Sep 2008 [Paul Querna / Justin]

We've fallen a bit behind our machine upgrade schedule due mainly to
persistent concerns over the stability of FreeBSD on eris (svn).  The
machines that need to be brought online are loki (a cold spare) and hermes
(a drop-in replacement for the existing x345 which serves mail).

Dealing with intermittent problems with the build process on the hudson zone.

Transferred the vmsa vmware instance to a zone on odyne for performance
reasons.

Created roughly 2 dozen new committer accounts.

Sebastian Bazley continues his work rationalizing foundation records and
authoring supporting scripts.

An issue came up regarding the ability, or lack thereof, of purging sensitive
data from the svn repository.  No action was taken at this time.

Yahoo! has been in touch with us to resume talks about a build farm donation.

Paul continues to work the issue on the incompatible drive trays.

20 Aug 2008 [Paul Querna / Justin]

We have a functional backup system in place, complete
with backups of select files within user home dirs,
thanks to Tony Stevenson, Gavin McDonald, Norman
Maurer, and Roy T. Fielding.

Philip Gollucci was flown down to Fort Lauderdale
to help set up the new colo site at TRACI.net.

The two Geronimo build machines, selene and phoebe,
were set up and handed off to the Geronimo PMC.

An experimental LDAP zone was set up to pursue
the idea of deploying LDAP in some capacity at the ASF.

Purchased a SCSI card for thor (build zones). Unfortunately
the existing array failed miserably (both PSU's died).
Currently pursuing a different path of installing drives
in thor (said drives were also purchased from Sun for ~$2000,
but need to be replaced due to incompatible drive trays.)

Wendy Smoak was granted infrastructure karma for her
work on the ASF's maven repository.

The maven snapshot repository was purged of all files older
than 30 days, which created some ripples within the community.
At its largest it was over 90GB, which means it contained
more bits than archive.apache.org. It currently stands
at 21GB.

Henning will work with Geronimo folks to start load monitoring on the new machines provided to their projects.

16 Jul 2008 [Justin Erenkrantz]

Purchased two dell 1950s for $10K.  The machines have been shipped
to the sysadmin and will be deployed shortly for Geronimo to use
for TCK testing.  We investigated EC2 as an alternative, but found
it wasn't cost-effective for the needs of the Geronimo PMC.

Entered into a monthly agreement with traci.net for ~$500 / month
for colocation services in Florida.

Ordered a switch and miscellaneous cables for setting up the colo.

Helped some new zone admins set up shop.

Set up the Sun t5220 as thor, which will be used for build systems
once we have the disk situation sorted out.

Work continues on setting up bia (backups), driven by Tony Stevenson,
Gavin McDonald, and Norman Maurer.

Set up odyne in SARA, which will be used as a zone host.

Made progress with Jason van Zyl regarding the adoption of maven.org and
the central repo machine by the ASF.

25 Jun 2008

Appointment of Infrastructure Committee Chair

 WHEREAS, the Board of Directors heretofore has charged the
 President with the responsibility of overseeing the activities
 of the ad-hoc Infrastructure Committee using the President's
 existing authority to enter into contracts and expend foundation
 funds for infrastructure, and

 WHEREAS, the Board of Directors recognizes Paul Querna as the
 appropriate individual to chair the infrastructure committee,
 with respect to executing the board approved infrastructure plan
 binding the Foundation to infrastructure contracts and associated
 financial obligations,

 NOW, THEREFORE, BE IT RESOLVED, that Paul Querna be and hereby is
 appointed to the office of Vice President, Apache Infrastructure,
 to serve in accordance with and subject to the direction of the
 President and the Bylaws of the Foundation until death, resignation,
 retirement, removal or disqualification, or until a successor is
 appointed.

 Special Order 7H, Appointment of Infrastructure Committee Chair,
 was approved by Unanimous Vote of the directors present.

25 Jun 2008 [Justin Erenkrantz]

Addressed the board's suggestion for more descriptive
service names by updating our nagios config and
improving the documentation in dev/machines.html.

Work continues on setting up bia (backups), driven by
Norman Maurer, Tony Stevenson and Gavin McDonald.

Gaea (zones) was determined to have BIOS issues.  As
mentioned in the previous report, it was shutting itself
down without warning.  A ticket was filed with Sun, and
remains open.  The problem hasn't occurred since upgrading
the BIOS, but we are carefully monitoring the machine.

Henk Penning continues to keep a close eye on the operation
of the rsync mirrors and their signed contents.

We purchased a NIC card and 8 GB of RAM.  The NIC card is
insurance against a failing NIC in eris (svn), and the 8 GB
of RAM was divided between gaea and hyperion (zones).

Several zones were created, the tuscany TLP was migrated,
and roughly 30 new accounts were created.

We are working with OSUOSL to get new hermes (mail) online and
a new Sun 5220 racked.

Eos and aurora (websites) had system upgrades performed by Mads
Toftum and Norman Maurer.

Norman Maurer was granted root karma on people.apache.org.

Gavin McDonald was granted apmail karma.

The general availability of svn.eu.apache.org was announced on
committers@.

Upayavira volunteered to represent the infra team at an OSS Watch
workshop on profiling open source communities.

21 May 2008 [Justin Erenkrantz]

We purchased 3 Dell 2950s costing about $12K total.

We purchased a certificate for svn.eu.apache.org, intended for use as an
svn mirror of our main repos. We are in the final testing stages now, and
expect to bring this machine into community service in the very near
future.

Sun donated 8 SAS drives and we had them shipped to Sander Striker. The
drives will eventually be installed in odyne, the 4150 we cannibalized to
get harmonia up.

Sun has offered to provide us with a support contract for our machines in
SARA, at no charge to the ASF. Details have yet to be finalized.

Old eris has retired itself due to its failing RAID array, and one of the
2950s was pressed into service via trial by fire. A tremendous amount of
effort went into bringing the replacement box online and stabilizing it-
chiefly by Philip Gollucci, Norman Maurer, Paul Querna, and Tony Stevenson.

Work continues on setting up bia, our backup host, driven by Norman Maurer
and Tony Stevenson.

Roy T. Fielding made significant improvements to our ezmlm installation,
eliminating the need for moderation on our commit lists.

Sebastian Bazley has done an incredible amount of work rationalizing
various foundation records.

Roughly 3 dozen new accounts were created this month.

2 TLP's, archiva and cxf, have been migrated.

In light of the recent Debian/Ubuntu security advisory (CVE-2008-0166)
regarding ssh/ssl, we have upgraded the host keys on all of our Ubuntu
hosts, and have scanned all the public keys on people.apache.org. We
investigated the ssl certificate on brutus and found that it predates the
vulnerability.

4 committers' and 2 members' accounts had their public ssh keys disabled on
people.apache.org for failing to comply with a request from root@ to remove
them within 48 hours of being notified. The keys in question were all
detected by the dowkd.pl script, and most users (~30 total) who received
the notice dutifully complied.

Gaea has started shutting itself down occasionally for reasons unknown. We
are investigating, but so far there's been little information to go on in
the logfiles.

It was noted that a map of host names to services can be found at

http://monitoring.apache.org/status/

Concerns about host names are deferred to the infrastructure team.

16 Apr 2008 [Justin Erenkrantz]

Purchased a 1y silver-level Sun support contract from Technologent
regarding OSUOSL-located Sun equipment.

We had the system board and a DIMM in eos replaced under the aforementioned
support contract.

Work continues on setting up bia, our backup host, mainly driven by Tony
Stevenson and Norman Maurer.  Both of them have been granted apmail karma.

Roughly 2 dozen new account requests were processed.

Sun graciously donated a pair of x4150s for use at SARA.   We have brought
one of them online as harmonia, and will be pressing it into service as an
svn mirror site.

A confluence-backed website was hacked into due to an improper permissions
scheme.  As confluence typically does not provide change notifications, no
record of the event was sent to the affected project's mailing lists. It
was later determined by David Blevins that roughly 30 other confluence
spaces were also misconfigured, further reinforcing the opinion of many
infrastructure members that the confluence installation at the ASF is a
bridge to nowhere.

Wendy Smoak continues her excellent work reviewing changes to the Apache
maven repos on repository@.

Gavin McDonald, Norman Maurer, and Tony Stevenson were all granted full
infrastructure karma.

19 Mar 2008

Pressed new brutus machine into production, shut down old brutus.

Brought bia (the new backup machine) online.  Work continues on
setting up the actual backups, mainly driven by Tony Stevenson
and Norman Maurer.

Experiencing some problems with jira's availability on new
brutus.  Jeff Turner is looking into it.

Migrated bugzilla to 3.0.  Special thanks to Mark Thomas and
Sander Temme for all the hard work that went into that process.

TLP migration for continuum was completed.  Roughly 2 dozen
new account requests were processed.

infrastructure-dev@ was made a public list at Jukka Zitting's
request.

19 Dec 2007

Acquisition strategy through May 2009; bottom line figure is $58,900.

Key features:
 - Replace our aging IBM x345s which are currently the cornerstone of
   our infrastructure - they are nearing 4 years old.
 - Stagger replacements so as not to do it all in one go
 - Gets a SVN mirror in EU
 - Equip a respectable build farm
 - Equip for CMS 'thing-ma-bob' with staging + 2 prod servers

The specifics of the machines may change, but this is an overall plan.

Acquisition strategy:

- OSL: Stay relatively power-neutral for next 4-6 mos; expand after then
- SARA: Expand to 20U 'early next year' (pushing for Feb.)
- x345s: Acquired in late 2003 / hermes prod in Feb. 2004
- Helios: In-service approx. April 2005

Decision:
 Base configuration: stick with x2200M2 with SATA drives

---

Base equipment costs [as of 10/28/2007]:

Machine:
 - x2200 M2 - 1x 2210 / 2GB / No drives: $1619/ea (incl. tax+shipping)
 - x2200 M2 - 2x 2218 / 8GB / No drives: $3871/ea (incl. tax+shipping)
 - x4150 - 1x Intel E5320 (1.86GHz; Quad) / 2GB / No HDD: $3082/ea (incl. t&s)

Machine extras:
 - CPU: AMD Opteron 2210 (1.8GHz): $179/ea from Newegg
   CPU: AMD Opteron 2218 (2.6GHz): $455/ea from Newegg
 - RAM: DDR2 PC2-5300 / CL=5 / Registered / ECC / DDR2-667 / 1.8V
  2GB sticks from Crucial: $135/ea [Buy in pairs]
  8GB  (4x2GB) -> $600 (incl. tax) [$581.81]
  16GB (8x2GB) -> $1200
  32GB (16x2GB)-> $2400

Storage:
 - Hitachi A7K1000 750GB SATA drive: $230/ea from Newegg
 - Seagate ES.2 1TB SATA drive: $339.99/ea from Newegg
 - Factor $250 for 750GB; $350 for 1TB

Derived cost for manual upgrade of base x2200 config:
 - 2x 2210 / 10GB / 2x 750GB -> $2919/ea ($3000/ea)
 - 2x 2210 / 10GB / 2x 1TB -> $3119/ea ($3200/ea)

2nd-level x2200 config:
 - 2x 2218 / 8GB / 2x 750GB -> $4371/ea ($4400/ea)

3rd-level x2200 config:
 - 2x 2218 / 8GB / 2x 1TB -> $4571/ea ($4600/ea)
 - 2x 2218 / 16GB / 2x 1TB -> $5171/ea ($5200/ea)
 - 2x 2218 / 32GB / 2x 1TB -> $6371/ea ($6400/ea)

x4150 config:
 - 1x E5320 / 4GB / 6x 750GB -> $4982/ea ($5000/ea)

---

Helios [Solaris zones]:

 13x 750GB SATA drives = $3250
 Battery backup replacement (Jan-Feb): $450
  http://www.memoryxsun.com/3705545bat.html [370-5545-BAT]

Conditional: Needs correct braces from Sun; ETA next week @ OSL.
Purchase: December
In-service: January
Helios array total: $3700

---

Brutus [Issues: JIRA/Bugzilla/Confluence]:

x2200 M2: 2x 2218s (4 cores) / 8GB RAM / 2x 1TB SATA / Linux x86_64
 [ Atlassian will support via official Sun x86_64 JVM ]
Purchase: December
In-service: January @ OSL
Expected price: $4600

---

SVN mirror @ SARA:

x4150: 1 Quad-Core Xeon / 4GB RAM / 6x 750GB SATA / FreeBSD
Conditional on: x4150 being available with SATA drives
Purchase: Late Dec / Early January
In-service: February @ SARA
Expected price: $5000
 [SARA box needs to be purchased thru Sun .NL; so may be more if in EUR.]

--

Eris replacement [SVN @ OSL]:

x4150: 1 Quad-Core Xeon / 4GB RAM / 6x 750GB SATA / FreeBSD
Purchase: March (after SVN mirror @ SARA setup)
In-service: April
Expected price: $5000

--

Loki [cold-spare @ OSL]:

x2200 M2: 2x 2210s (2 cores) / 10GB RAM / 2x 750GB SATA / FreeBSD
Purchase: May
In-service: June
Expected price: $3000

--

Hermes [mail @ OSL]:

x2200 M2: 2x 2210s (2 cores) / 10GB RAM / 2x 750TB SATA / FreeBSD
Purchase: July
In-service: August-September
Expected price: $3000

--

Build farm (@ OSL) - stage 1:

(2) x2200 M2: 2x 2210s (4 cores) / 16GB RAM / 2x 1TB SATA / Linux (VMWare Wks)
Purchase: September
In-service: October-November
Expected price: $11,400 (2 @ $5200)

--

Build farm (@ OSL) - stage 2:

Apple xServe: 2x Dual-Core Intel Xeon / 1GB RAM / 80GB SATA
Purchase: January '09
In-service: February '09
Expected price: $3200 ($2,999 + tax&shipping)

--

CMS thing-ma-bob's:

Staging @ OSL: x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux
Prod @ OSL:    x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux
Prod @ SARA:   x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux
Purchase: February '09
In-service: March-May '09
Expected price: $20,000 (3x$6400)
 [SARA box needs to be purchased thru Sun .NL; so may be more if in EUR.]

19 Jul 2006 [Sander Striker]

Approved by General Consent

27 Jun 2006 [Sander Striker]

Tabled due to time constraints.

15 Mar 2006 [Sander Striker]

Sander provided an overview of the current state of the ASF Infrastructure, as summarized in the President's report above.

Approved by General Consent.

21 Dec 2005 [Sander Striker]

No report received or submitted. Sander to be contacted regarding status.

21 Sep 2005 [Sander Striker]

Sander submitted an oral report, noting that Infrastructure was very busy. Leo Simons sent an Email to committers@ asking for volunteers to help.

Approved by General Consent.

22 Jun 2005

April-June

The infrastructure team has been so busy it hurts. We have migrated a few
more projects to top level, migrated a few from cvs to svn, added some new
infrastructure projects, users, "the usual". We are seeing roughly 600
emails a month on the infrastructure mailing list, excluding svn commit
messages.

Besides the usual, some things of note include that

Nontechnical
------------
* we have slowly gotten to work on the board's request to formulate RFPs
 on paid staff/outsourcing.

* OSU OSL has generously offered to provide us some hardware along with
 hosting at their colo.

* we have set up a new mailing list, site-dev@apache.org, which is tasked
 with figuring out a flexible and powerful new website publishing
 process.

* we have sollicited volunteers which has yielded roughly half a dozen
 responses from previously silent people as well as somewhat inactive
 infrastructure participants offering to become more active.

* we have found we are not currently in optimal shape for productively
 growing our team and getting people things to do. Work is in progress
 to improve that.

* quite a bit of work has been done and is still in progress on internal and
 project-oriented documentation.

* we have some promising submissions to the google summer of code programme
 for helping with the asf-workflow tool.

* we are planning another infra-thon (infrastructure team gettogether) in
 the weekend leading up to apachecon which should be considerably more
 modest and hence cheaper than the last one.

Technical
---------
* our mailserver has had a lot of trouble handling the load (mostly
 spam) lately. Solutions being explored include optimizing the
 machine's configuration, patching the mail software to be more efficient,
 taking into operation a new machine at OSU OSL, and generally anything
 else we can think of. It has taken a lot of time just to keep things
 running, and we have seen some service interruptions. We may need more
 hardware if the amount of spam that goes around the web continues to grow.

* brutus has been partially reinstalled to serve as another FreeBSD host,
 which hopefully will be finished this week. We will use it to take over
 some of minotaur (our main box) its duties as we upgrade that.

* AMD has donated a machine for running apachecon.com which we will be
 hosting in our rack in the San Francisco colo. The machine is scheduled
 for install this week.

* loki (running vmware) has been partially configured for gump runs.

* we have another machine on loan at OSU OSL for gump runs which is not
 in operation yet.

* quite a few (dozen??) of zones have been set up on the new sun box,
 helios. Various PMCs have been busy setting a variety of services up.
 Helios is still in a testing phase.

* we are hoping to take the raid array that came with helios operational
 this week.

* we experienced a lot of performance problems with the wiki which were
 solved by putting in place a development install of apache httpd with
 mod_cache.

* JIRA has been upgraded to the most recent release. It is still being
 tuned to be as stable as it was before. Users are seeing a notable
 performance increase and fancy new features such as automated links to
 subversion changes.

* Our SVN service had a few hickups, but incident rate seems decreasing
 whereas usage is still increasing. The majority of our projects are
 now using SVN, with more projects in the migration queue.

* the certificate service (ca.apache.org) has seen a lot of development
 work recently.

* we have moved DNS registrars. We are now with dotster??

* Serge's Nagios installed has been moved to monitoring.apache.org and
 reconfigured to provide even more useful information.

* We have purchased and installed another PDU at our main colo in SF

30 Mar 2005 [Sander Striker]

The Infrastructure Team has gathered for an Infra-thon from
March 18th (Fri) til March 22nd (Tue).  A travel/lodging budget of
roughly $3500 was spent.  The final number is not available at this
time.

All machines based at Mission St. have been move to Paul, the
new location of the UnitedLayer colo.  Three passes have been
requested for physical entry to the colo, for: Brian Behlendorf,
Scott Sanders and Sander Temme.

Some issues have arrised during the infrathon, but none have
cause severe downtime or unacceptable discomfort to our projects.

Due to an error on part of the team we have now rougly figured
out how much bandwidth we would use without the use of mirrors;
rougly 70Mbps.

The power distributors bought as approved by the board have not
been used.  It turned out that we can actually use the 0U PDU
that we already have, and, there is room for a second one.
We'll have to return the unused PDUs; Mirapath is working on
a quote for a new 0U PDU.

We are planning on moving all shell accounts to elsewhere
at some point.  This will be hosted under people.apache.org.
DNS has already been updated, meaning our committers have to use
people.apache.org to log into their shell.

Also, we managed to get hermes (our primary mailserver) failing
every 7-8 hours.  This problem seems to have disappeared after
a BIOS update and/or a kernel update.

Brutus, the current Gump machine, can remain under supervision of the
Gump PMC until we are done setting up the new Sun server, helios, and
loki, our machine purposed for running VMWare.  After that, brutus
will most likely become our secondary mailserver.  The plan was will
be shipped to The Netherlands, where it can do its work, and serve
as a fail-over in case of failure of hermes.  However, due to external
conditions that are applicable to the IBM machines we cannot ship any
of them out of the US at this point.

Minotaur, our machine usually serving our websites, subversion and
shell accounts was relieved of serving the websites prior to the colo
move.  We've left this to be the case because we wish to upgrade
the OS on that machine in the near future.  We are synchronizing
content to ajax every 4 hours.  We were planning to upgrade minotaur
to FreeBSD 5.3.  Unfortunately there were some setbacks that
prevented us from doing a backup.  We weren't feeling very lucky,
and have decided to put off the upgrade til a later point in time.

Ajax, our european hosted machine, is currently hosting most
of the websites (with the exception of tcl and perl), as well as
the wiki, jira and bugzilla.  We've switched off the indexing by
search engines for the wiki and the issue trackers since the load
on the machine was insanely high.
Ajax too is scheduled for an OS upgrade, due to the fact that is
is stuck in IO wait half the time.  We will however not be doing
this while most of our infrastructure is hosted by that machine.
It has held up pretty nicely, even when we at some point were not
using the mirrors and it was handling most of our traffic (we
peaked at 50Mbps).

Loki is going to be our machine hosting VMWare.  We've added 2GB
of memory and 2 36GB disks from one of the other machines, giving
it enough beef to actually run the ESX installation.  It's primary
purpose is going to be to host various OS instances for Gump to
run on.

Helios is our new Sun v40z.  It came with a 1TB StorEdge RAID array.
Unfortunately it was delivered without an OS installed, no install
media, and, the FibreChannel card to connect the StorEdge to the
machine was missing.  This will all be resolved, but for now we
are not able to swiftly put this box into production.  However,
when put in production this machine will host several different
so called zones.  One of these zones will be people.apache.org,
which will host all current shell accounts.  Every PMC will most
likely also get their own zone, which it can use as a testing
ground/showcase for their own software.  Finally Gump will be
given one or two zones for their runs.

Eris, our final machine, is stripped down to just the chassis
pending hardware to refit it.

The wiki has seen an upgrade, which was quite an undertaking.

Preparation was done for the Bugzilla upgrade.  The final
upgrade will be done at a later stage.

Work is progressing on the Eyebrowse to mod_mbox migration,
preserving the Eyebrowse urls.

Subversion is doing fairly well.  We've seen a number of repository
wedges, which seems to have nothing to do with the core
functionality of Subversion, but rather has to do with the add-on
functionality provided by ViewCVS.  We are aiming to cure the
symptoms, since we have not been able to pinpoint the cause, by
moving our Subversion backend from fs_bdb to fsfs.

All in all the infrathon has proven to be a valuable experiment.
For the future I personally would consider limiting infrathons to
software work only, defering all work involving hardware to the
locals.  The high bandwidth communication as well as the near to
full time availability of the entire team has proven to be
incredibly useful.
That said, the focus for our services will have to shift at some
point to integration and coupling as currently we are growing more
and more islands that in consequence require a relatively large
support team.  Reducing complexity for our users as well as the
Infrastructure team is definitely something to consider.

Infrastructure Team report approved as submitted by general consent.

15 Dec 2004 [Infrastructure team]

We're transitioning from nagoya to ajax.  We've submitted the final H/W list to
Sam for IBM, and are waiting to hear back from him.  We're going to proceed
with the purchase of a UPS for our co-lo rack shortly.

We are also coordinating an infrastructure get-together in SF for Q1 2005 so
that we can address the pressing large-scale items on our plate with as many
people in the same room as possible and near our servers to coordinate server
hardware and software upgrades.  Financial assistance to help bring the
participants together is desired.

Apache Infrastructure report approved as submitted by general consent.

6. Special Orders

7. Discussion Items

14 Nov 2004 [Sander]

 Infrastructure benefits a lot from the face to face meetings at
 ApacheCon.  Work is getting done.

 David Reid and Ben Laurie have been working on the CA and things
 are looking very promissing.  We will be able to offload all the
 adding and removing people from groups.  This will require that
 all services are exposed via HTTP, so there is a lot work that
 needs to be done.  With the moving of several projects to SVN,
 this goal will be easier to reach, given that shell accounts
 won't be needed anymore to do actual development.

 People do actually volunteer, but it is hard to actually parallellize
 a lot of the tasks given the centralized knowledge.

 Services are being moved around and off nagoya, since the machine
 is being retired.

 Infrastructure is budgetting $5k for the acquisition of a UPS
 for the US colo.

22 Sep 2004 [Sander]

 Sander reported that Infrastructure is finding itself
 steadily overworked, as well as there being confusion over
 how much authority Infrastructure has; he pointed to the
 mirroring policy as a prime example.  He was happy to
 report that additional volunteers have joined, especially
 Roy.

18 Aug 2004 [Sander]

 The Infrastructure team is battling the same problems as
 reported before.  Infrastructure needs help to get things done.
 The root rotation doens't get filled.  An obvious way out
 is hiring a (parttime) sysadmin; so that the team can focus
 on getting automation tools developed making the job less
 involved.

 That said, Ken Coar offered help in writing mailing list management
 tools.  Also Geir offered to help out with the mailing lists.
 And we are happy to announce that Berin Lautenbach has joined the
 group of roots.

 Hardware wise we gained a switch contributed by Theo van Dinter.
 We are waiting for that to arrive.

 Ajax still doesn't have console access, nor do we have the
 accounts on the power switches to power cycle the box.  SurfNet
 also asked us to work out the reverse DNS on our end, which we
 haven't gotten around to yet.

 DNS is under the Infrastructure's team control again, and we
 are happy to announce that we added to more secondaries, making
 our DNS servers a bit more globally spread.

 A lot of work has been done getting VMware instances up and running
 and this seems to pay off.  Investigation in applicability is ongoing.

 Services hosted on nagoya will be moved off to other boxes, given
 the current (in)stability of some of the services on nagoya.

 eris, one of the new IBM x345s seems to be having some problems.
 Investigation is ongoing.

21 Jul 2004

 One issue - one of the new IBM boxes not functioning correctly,
 and budget is needed for switch and UPS, but still working through
 so no current action items.

21 Apr 2004 [Sander]

 Sander gave a verbal infrastructure report.  Recent events have
 included minatour disk problems; new disks have been sourced.
 The new machines are now in the racks and are currently being
 tested.  There was some discussion with United Layers over
 power consumption, but this has been resolved without requiring
 any action.

21 May 2003

[ from Brian Behlendorf ]

Major efforts:

New ASF server was paid for by ASF check and picked up by Brian.  Brian to
 set up an in-person meeting for the local ops team to check out the box,
 learn how the RAID works, etc.  To be scheduled for some time next week
 most likely.

The colocation agreement with United Layer was signed.  I'll send a copy
 to Jim for our records.  We can move in any time, billing starts a few
 weeks after we move in our first box.  First move might happen this week
 if I get my act together (Brian speaking).


Other work done by the team:

Created 22 accounts: jesse, antoine, andreas, alexmcl, egli, gregor,
 michi, edith, felix, liyanage, memo, thorsten, minchau, sterling,
 psmith, sdeboy, brudav, michal, cchew, yoavs, funkman, joerg
Removed 1 account: mehran
Updated the account creation template in the infrastructure repository,
 actfrmtmpl.txt
Dealt with a couple out-of-disk-space situations
Dealt with some runaway robots
Discussed whether to change the commit-mailer to only send viewcvs URLs
 if the commit message would otherwise be too big.
Updated the Subversion installation on icarus
Fixed a content-encoding issue on the web server, and turned off SSI
Updated search.apache.org


Also of note:

A demo of SourceCast for apache.org has been set up, and I've shared
 access to it for a few folks.  Most of them are busy, though, so if
 others would like to give it a try, write me privately.  I'll be putting
 together a true plan for ASF evaluation over time, this is just an
 initial kick-the-tires kind of thing.

30 Oct 2002

Establish an infrastructure board committee

    The following resolution was proposed:

    WHEREAS, the Board of Directors deems it to be in the best
    interests of the Foundation and consistent with the
    Foundation's purpose to establish an ASF Board Committee
    charged with maintaining the general computing
    infrastructure of the ASF.

    NOW, THEREFORE, BE IT RESOLVED, that an ASF Board Committee,
    to be known as the "Apache Infrastructure Team", be and
    hereby is established pursuant to Bylaws of the Foundation;
    and be it further

    RESOLVED, that the Apache Infrastructure Team be and hereby
    is responsible for creating and upholding the computing
    policy for the Foundation; and be it further

    RESOLVED, that the Apache Infrastructure Team is charged
    with managing and maintaining the infrastructure resources
    of the Foundation; and be it further

    RESOLVED, that the Apache Infrastructure Team is charged
    with accepting infrastructure resource donations to the
    Foundation; and be it further

    RESOLVED, that the Apache Infrastructure Team is responsible
    for handling communication and coordination in relation to
    infrastructural issues; and be it further

    RESOLVED, that the persons listed immediately below be and
    hereby are appointed to serve as the initial members of the
    Apache Infrastructure Team:

      Brian Behlendorf (chair)
      Justin Erenkrantz
      Pier Paolo Fumagalli
      Ask Bjoern Hansen
      Aram Mirzadeh
      Steven Noels
      David Reid
      Sander Striker

 Discussion on this resolution focused around the need for such
 a Board Committee. Roy Fielding noted that such a committee
 might be best handled as a President's Committee since the
 President, rather than the Board, is in charge of operational
 aspects of the ASF. It was further discussed that such a team
 would be a good idea to create a focal point for long term
 initiatives, as a content point, and to create a sense of
 empowerment for the people interested in the technical
 infrastructure of the ASF.

 By general consent, this resolution was tabled, with a
 recommendation to the President to establish a President's
 Committee with the same goals and responsibilities.