Formal board meeting minutes from 2010 through present. Please Note: The board typically approves minutes from one meeting during the next board meeting, so minutes will be published roughly one month later than the scheduled date. Other corporate records are published, as is an alternate categorized view of all board meeting minutes.
Recent Issues ============= Nothing is needed from the President, or the Board. These are reported as an FYI only. We have seen issues regarding SHA-1 vulnerabilities, supporting the Apache Maven project, and changes in our build systems. Details are provided below, in the "General Activity" section. Finances ======== We spent $525 to purchase several months of online training. This is an unbudgeted amount, but we believe its unlimited coursework for the entire team, for three months, was worth the experiment. At the end of the period, we will evaluate whether an extension is warranted. Costs for ongoing staff education will be included into our next budget request. Short Term Priorities ===================== - Decomission ASF-owned hardware at an accelerated rate, in favor of cloud-provided servers - Continue ramping up the Gitbox service - Balance our datacenter usage for cost efficiency - Gitbox/Jira integration Long Range Priorities ===================== - As reported before: continued migrations of our legacy servers and services into new puppet-based services that we can efficiently deploy to cost-effective cloud providers. - Automation to reduce the incremental cost of regular Infra tasks - Migration from Puppet V3 to $nextgen system for providing services General Activity ================ We set up a new web area for the Directors to create an authoritative set of pages for Board-approved policies and commentary. Some initial work by Directors has populated some data/pages, but the site design and content is still in its infancy. The plumbing appears to work, so we're "done" and will follow with continued support. There has been a lot of Internet discussion about Google and CWI finding and publishing a SHA-1 collision, and their statement that it is now possible to construct additional collisions. They will be releasing further data in a few months. From our initial analysis, this issue only affects our Subversion services as a limited denial of service, instituted by an Apache committer (NOT by a third-party attacker). The Apache Subversion community has been discussing and analyzing the issue, including the extent of the problem and appropriate mitigations. We have already deployed a script developed by their community, to prevent a committer (or a compromised account) from pushing either collision documents into our repository. We have confirmed that our website certificates DO NOT use the SHA-1 algorithm. This has been in place for quite a while. This past month, we discovered that we cannot support the growth of Apache Maven's private copy of the Maven Central repository. We previously offered the PMC a VM to keep a copy (should M.C go dark, we'd retain all necessary data for the ecosystem), along with CPU to perform analysis against that copy, but looking at the storage growth rate, we determined that this offering was not sustainable within the current Infrastructure budget. We notified the Apache Maven PMC that we needed to retract the offering, and for them to seek specific budget support from the Board for their needs. Over the past year, the Infrastructure Team has moved to a policy of "Ubuntu Only" for our machines, to lower our costs. In the past, we had a lot of time to support multiple operating systems, services, and customized software deployments. With the rapid growth of the ASF, and the resulting demand for Infra support, we have pulled back on the edge cases to focus more strongly on ROI of our staff's work effort. That has resulted in the Ubuntu policy, which then resulted in the decommissioning of Solaris, Mac OS, and FreeBSD build slaves in our buildbot service. Needless to say, that has raised concern within several probjects who relied on the availability of those platforms. We have made it clear that the Infra Team will integrate third-party custom build slaves into our system, so that projects can use those slaves for their non-Ubuntu builds. Uptime Statistics ================= Overall uptime reached 'three nines' with 99.9% uptime. The 'worst offenders' this period were writeable git repositories (due to a TLS bug) and Jenkins, though none of the critical or core services went below 99%. For more information, please see http://status.apache.org/sla/ Community Growth ================ This period, we've had one new contributor to our puppet repository, Jitendra Pandey, as well as contributions from 9 people who contribute on a regular basis. on the JIRA side of things, we had 29 new people interacting with infra via JIRA, making it 29 new users, 93 regulars (people that have contributed before and do so often), as well as 5 'returnees' (people that have been absent for >2 years but are now contributing/reporting again). On a 3 month view, we've now had code contributions from 22 people, of which 13 were regular contributors and 9 were new. 94 people have worked on or reported a JIRA ticket for the first time, while the other 187 who worked on or reported issues had doen so before. GitHub as Master ("Gitbox") =========================== Our Git services are planned to land on gitbox.a.o, so we generally refer to this as "gitbox". In the past month, we start moving the OpenWhisk podling over to the gitbox service. That has been a very slow move, so Tika and Nutch have recently been added to gitbox. It is too soon to remark on problems and SLA for these communities using GitHub as their master/primary focal point of development. These communities have enough activity to help us surface and pinpoint problems. We've made several improvements based on feedback, and still need to implement some Jira integrations. We will likely add more projects before the next Board meeting, and will report on such additions. Cost-per-Project Reduction ========================== An important, needed clarification arose this past month, regarding the definition of this effort. Reducing the cost-per-project is about managing the marginal/incremental cost each time a project is introduced to the ASF infrastructure. This effort is not about the *overall* Infrastructure bottom line (eg. staffing, training, travel costs) as those costs have *very* tenuous connections to the incremental cost of a new project. There is certainly a mild connection to hardware/hosting costs, as we offer VM services to projects. Those VMs create a very real cost to the ASF, and we are in-process on a way to track and allocate those costs. The more direct costs appear to be related to the work that Infrastructure performs when a podling is accepted, and when a podling graduates. These events create a lot of work around managing mailing lists, repositories, Jira, wikis, etc. This is the incremental costs that we hope to reduce, through automation, once we are done with the higher-priority work of VM migrations.
Finances & Operations ===================== We've been working with Virtual to deal with a process improvement that results in better privacy protections for HR-related data for our contractors and employees. Short Term Priorities ===================== - Get one or more projects launched on the Gitbox system - Training of our new staffers, particularly towards VM migration - LDAP changes to support podlings, and to integrate it within our supported services (eg. Sonar and Roller) Long Range Priorities ===================== - Move all services off ASF-owned hardware, including the difficult process of moving our email infrastructure - Finish the use of puppet for all services, then explore to move towards upgrading to Puppet 4, and/or containers for ongoing service management General Activity ================ - Tightened/streamlined git sync processes (see paragraph below) - Ongoing conversation and development plans for integrating podling management into LDAP (and other ASF tooling; particularly, gitbox) - Finalizing launching Fisheye services locally at the ASF to replace the third-party service run by Atlassian. - Restructured the git repository request service (reporeq.apache.org) to better handle podlings, in particular assist them in setting the right name and notification lists. - We suffered a catastrophic hardware failure on the physical machine that was hosting the application side of Jira. The service was fully-puppetized and relocated to a VPS. More details will be published after a post-mortem, during the week of the 16th. Uptime Statistics ================= Uptime for this month has been around 99.6% overall. Some issues with git-wip running out of memory at times have pushed the uptime for this down to around 98%. We are investigating the issue. blogs.apache.org moved to a new host, which also caused a small amount of downtime. Jira going down hard has not helped. For more details, please visit: http://status.apache.org/sla/ Github as Master ================ The GitBox project is pending responses from the pilot projects before it can continue. We are looking at multiple potential candidates at the moment for this. Rather than wait on external groups, we will be using Infrastruture's own website repository for our testing. Git mirror/GitHub sync process ============================== The sync process between Subversion and Writeable git repositories to git.apache.org and onwards to GitHub has been suffering from missed syncs lately, at an approximate rate of 1 miss out of every 10 hits. The process has been improved and the logging also widened, so we can better analyze any failures that may occur. The upgrades appear to have cut down on the missed syncs by a large factor. We have, as of this writing, had 2 missed syncs compared to the 100 or so we usually have, and both seem to be attributed to timeouts pushing to github. The error rate going from git-wip to the pubsub system has been reduced from around 10-20 errors per day to 0 by refactoring the pubsub agent. We continue to monitor the situation and address any bugs that may show up. Community Growth 2016 ===================== Since this is a new year, it might be worth looking back at 2016: - We had 34 people contributing to our codebase (puppet) for the first time in their involvement with the ASF, compared to 12 people who regularly contribute to the repo. - 354 new people have filed an issue with infra for the first time, while 329 people, who regularly work on issues, have also been contributing to the 2,321 issues created in 2016. - 9 people who were previously active on JIRA have now started contributing code (patches etc) to Infra. 'Costs per project' Project =========================== Unfortunately, I've been a bit swamped with other things and this hasn't gotten many cycles. Expect more on this issue next month.
Finances ======== This past month has seen lots of activity in reviewing the FY17 "actuals" and projecting our overall FY17 expenditures, within the Infrastructure section of the budget. These numbers have been coordinated with the President and with Virtual, and are presented elsewhere in the December Board agenda. In short, Infrastructure is forecasting increases in staffing costs, cloud services, and a small amount related to a once/year gathering of the team at ApacheCon. On the other side, we're looking at lower hardware costs as we transition from ASF-owned machines towards a more flexible posture using virtual private servers in our cloud providers. Short Term Priorities ===================== - Get one or more projects launched on the Gitbox system - Training of our new staffers, particularly towards VM migration - LDAP changes to support podlings, and to integrate it within our supported services (eg. Sonar and Roller) Long Range Priorities ===================== - Move all services off ASF-owned hardware, including the difficult process of moving our email infrastructure - Finish the use of puppet for all services, then explore to move towards upgrading to Puppet 4, and/or containers for ongoing service management General Activity ================ - Finalizing the (emergency) move of the moin wiki - Continued gitbox work (see separate section) - Ongoing conversation and development plans for integrating podling management into LDAP (and other ASF tooling; particularly, gitbox) - SonarQube puppetized, LDAP-enabled, and brought up. This was a big VM move, and will enable self-service of sonar jobs - Continued work on puppetizing Roller (blogs.a.o) and moving that VM. Testing is now beginning. - Continued puppet work and VM moves - Beginning investigation of launching Fisheye services locally at the ASF to replace the third-party service run by Atlassian. Numerous projects use the service, but it will be shut down in early January. We're costing out a local replacement. - The security-vm is in testing, with a new Jira instance. This Jira will be used (privately) for the Security Team and Brand Mgmt. Uptime Statistics ================= Uptime for this reporting cycle hit 99.9% overall, with critical services staying at an impressive 100.0%. The usual culprits, Jenkins and SonarQube were responsible for dragging down the overall score, and are being replaced/updated to address this. The moin wiki, which has been moved and cleaned up heavily has improved immensely, going from a previous average of 94% uptime to a solid 100% uptime over the past few months. This weekend (Dec 17-18) we were hit by an outage at NERO, causing outages for the services hosted at OSUOSL. While the situation has been resolved, this will reflect negatively on next month's SLA. For more details, please visit: http://status.apache.org/sla/ Github as Master ================ The GitBox project is moving ahead as planned, and is generally considered ready for testing with willing projects. We are in talks with a specific project for the initial tests, and will discuss onboarding more projects as this progresses. The services involved have been set up and fully puppetized, and tests are showing good results here. While this service depends on LDAP (and thus awaits the upcoming LDAP changes for podlings), we have modified the system to work with a hardcoded list of podling members, so we can test with podlings without having to wait for the LDAP changes. We have a list of things either to be done or that have been done, at: https://pad.apache.org/p/gitbox - this outlines what we were thinking, where we are, and what remains to be done before we can consider this service production ready. We invite everyone to visit https://gitbox.apache.org/ and have a look, perhaps even try out the account linking service and provide feedback on this. 'Costs per project' Project =========================== As a long-term project, Infrastructure has been tasked with reducing the "cost per project added into our systems." There has been some work at the margins of our self-servce tools and processes, to reduce the staff/volunteer time to provide service and to reduce resource costs of these services (eg. locating services on more cost-effective providers). However, we have not made any progress on computing current per-project costs, projecting those costs, or planning mitigation strategies.
Additions to Infrastructure =========================== Sebastian Bazley (sebb, apmail karma) Freddy Barboza (fbarboza) Chris Thistlethwaite (christ) Christofer Dutz (cdutz) In addition to the above karma grant; our two new staff members are ramping up. (Freddy Barboza and Chris Thistlethwaite) Expect forthcoming blog posts from our three recent hires on the infra blog. Operations Action Items ======================= Short Term Priorities ===================== - Ensuring we have gathered enough information to begin the Gitbox experiments. - Making sure the new wiki instance is running smoothly - Introducing new staffers to the infrastructure - Networking/F2F meetings at ApacheCon Europe - Further work on rebranding our new web site Long Range Priorities ===================== - Stand up a service for mirroring repositories and events from GitHub - Mailing list system switchover is expected to happen in the coming months. We are aware of a few outstanding mail-search requests that we have concluded could be solved out-of-the-box by this. - VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommissioning - Deprecate eos (currently only running mail-archives, wiki was moved) - Work on weaving podlings into LDAP - Further explore identity management proposals General Activity ================ - moin wiki (wiki.apache.org) was moved to a new, faster box and inactive accounts were pruned to greatly increase responsiveness. - More package consolidation and updates on the Jenkins platform - Fixed issues with the buildbot configuration scanner not working as intended. - Stood up buildbot and jenkins setup for publishing web sites via git. - Moved more VMs from vCenter to new cloud locations. - gitbox (stage two of our github experiment) has been stood up as an actual machine, with some services working already. We expect to be able to start mirroring and gathering push logs in the coming month. - In conjunction with gitbox, reworked the overall design of our writeable git repository web interface (and received positive feedback). - Reworked policy for new git repositories so that new repositories are automatically mirrored and have github integrations enabled by default. - Worked on a new web site for infrastructure. Some components can be used for projects wishing to switch to a git-based automated workflow. - Added and fixed a bunch of jenkins build nodes. - Debugged and (hopefully??) fixed some issues with our pubsub systems - Consolidation, general sanity checks and harmonization of Jenkins builds. Ticket Response and Resolution Targets ====================================== Stats for the current reporting cycle can be found at https://status.apache.org/sla/jira/?cycle=2016-11 The tentative goal of having 90% of all tickets fully resolved in time is still being used. We are hoping that the onboarding of new staff will help greatly improve these stats. Uptime Statistics ================= Uptime this month was mostly affected by the moin wiki, which we have now moved. We expect next month to have a much better uptime than this month (99.79%) For detailed statistics, see http://status.apache.org/sla/ Github as Master ================ Sam asked us to report on this process. To date, we've finished our discussion with Legal Affairs to ascertain the framework that things have to operate in. Additionally, we've started discussions on private@ about the ASF-side of that work. You can see the documentation from that discussion: https://pad.apache.org/p/gitbox We have an instance running in AWS (the "gitbox") to run the above. Finally, we've identified an initial project to subject to the experiment. 'Costs per project' Project =========================== Sam has asked to focus on looking at ways to reduce the straight linear expansion of costs for each additional project that we add. To date, this remains a bit of a thought experiment on both how to measure the costs in a more granular way than we do now. Currently the data that we have shows a correlation between increase in requests for service, bandwidth, etc. This correlation has led to our existing planning/staffing models, but there remains much to be done on this front. The goals for next month is to figure out our what specific points we need to be measuring, and ideally looking backwards to validate what the actual rate per project is in terms of consumption. The goal for Q1 is to break that down, and figure out what the constraints of the current onboarding and ongoing operations are.
Operations Action Items: ======================== Short Term Priorities: ====================== - Work on experiment with ProxMox for virtualization to aid in moving VMs from the vCenter cluster, slated to be decommissioned. - Further explorer upgrading puppetised machines to Ubuntu 16.04 - Engage with and onboard new staff once hired. Long Range Priorities: ====================== - Explore moving podlings to separate LDAP entries, which would also make the MATT/GitHub experiment much easier in the long run. - Mailing list system switchover is expected to happen in the coming months. A couple of outstanding tickets are pending, and we have yet to design a working redirect, although this is expected to be trivial. - VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission, - Moving JIRA to a new location, as the old machine is slated to be decommissioned. General Activity: ================= - Infrastructure will be present at ApacheCon Seville to interact with the wider ASF community. - Maven backup repository is being moved to a different DC to save cost - LDAP clusters are being resized to accommodate DC IP shortages - VPN scenarios being worked on to reduce the number of public IPs used Ticket Response and Resolution Targets: ======================================= Stats for the current reporting cycle can be found at https://status.apache.org/sla/jira/?cycle=2016-10 The tentative goal of having 90% of all tickets fully resolved in time is currently not being met, but we expect the rate to go up once new staff has been onboarded. Uptime Statistics: ================== Uptime stats have been very stable over the past few months, at 99.9% for critical services and 99.8% in total. For detailed statistics, see http://status.apache.org/sla/
A report was expected, but not received
Short (and tardy) report this month. Hiring --------- We saw the blog post announcing the open position get reposted to several remote working job boards and have had a decent response in terms of resumes showing up. We made a first pass over approximately 1/3 of the incoming resumes as of this writing and hope to finish that first pass by end of week. We've seen a number of promising resumes show up and hope to be able to move forward with them. Hopefully we'll have material updates to provide to this in the coming week. General activity ------------------- Ongoing work on Jenkins and Buildbot slaves has finally coalesced into being completely managed via puppet. This is a huge milestone as these are relatively complex configurations and because of the number of machines it involves. It's also gotten some critical acclaim. This should make it much easier for folks to get additional build dependencies installed in a more self-service manner (simple pull request against the repo compared to filing a ticket and having that done by infra) Work is in preparation for migrating our Jira instance off of our existing hardware. The current hardware is approximately 6 years old, and our instance has grown so much that the underlying database infrastructure on separate machines is starting to become a constraint. Growing on our blocky (infrastructure-wide IP bans for violation of rules against one or many hosts) we've added a dashboard to both manage the rules and blocked IPs. Github experiment ----------------------- Nothing material to report. Things seem to be largely 'just working' We are working on adding an incubator project to the github experiment - we expect this will push the limits of the project as there are literally thousands of incubator committers, so it should be an interesting threshold to pass and gauge our ability to cover them.  https://s.apache.org/A9oF
As reported last month, we began, and successfully completed new contract negotiations with the contractors this month. We are still suffering staff shortage and that continues to deleteriously affect many things. Infrastructure has seen some notable problems. A intermittent Jira outage affected a large number of users. The folks at Atlassian assisted us in diagnosing the problem and moving forward. Additionally one of our VMware hosts has an ailing storage array. While work was underway to evacuate all of the hosts, this storage malaise has exacerbated the situation.
Operations Action Items: ======================== Hiring -------- We've published a job description and are working with the President and Virtual to publish this widely in efforts to solicit more candidates. Demand for Infra services ----------------------------------- The number of projects that the Foundation is responsible for continues to grow, and that is placing an ever increasing burden on demands for infrastructure resources. Today, our largest constraint is staff members to do the work. Historically, we've had an average of 33 tickets per month per full time staff member, and as that average grows we typically add staff. Today we have 3.5 full time staff members - working, though statically, we really should be at 5 just to be able to handle the ticket load. (This does not include any time to focus on larger scale projects) Given our current rate of growth, addressing tickets alone will require 5.5 full time staff members by years end, and if we continue at our current pace we'll need to add 1 to 1.5 staff members every 2 years just to deal with continued growth. Take a look at a graph demonstrating the growrth rates, demand for services based on tickets, and staff members: https://i.imgur.com/72V0DFN.png Short Term Priorities: ====================== TLP Some of the automation behind the mechanics of transforming a graduated podling into a TLP fell into disrepair over the past few months. This led to many exaggerated timelines for TLP graduation. Infra held a TLP work day and while it largely remains a manual operation there is now a runbook that is current for dealing with graduations. And all of the pending graduations were processed. There is ongoing work to automate large swaths of that, but for now it should only take ~30 minutes to process a newly-graduated TLP. Long Range Priorities: ====================== Because of staffing shortage, precious little work has occurred on our long range priorities. Instead we've moved back to what can largely be described as firefighting and attempting to keep up with incoming work as best as can be managed. General Activity: ================= Outages: ------------ We suffered a somewhat longterm outage of the Nexus repository. We temporarily restored the service, but a service move off of the ailing VMW infrastructure is planned for the short term future. More recently, we suffered both a database and VMware-related outage on the same day. Our VMware infrastructure is on increasingly brittle hardware. We have been concerted efforts for some months on moving VMs off of this infrastructure, and continue to do so. New Contracts -------------------- Following up on discussions that occurred at ApacheCon NA, we are beginning the process of renegotiating contracts for our non-employee staff members. Uptime Statistics: ================== Overall we met the service uptime, though on an individual service basis we did not meet the uptime expectations for repository.a.o. Please see: http://status.apache.org/sla/
Operations Action Items: ======================== Staffing ------------- Our staff is down by roughly 36% currently. That has impacted SLAs, particularly those around response times to tickets. Staff members have been interviewing a potential new hire. Short Term Priorities: ====================== Code signing ------------------- After initial plans to discontinue this service due to high cost. we were able to succesfully negotiate a satisfactory contract with Symantec. Jira Spam ------------------------- Our Jira service was again under attack this month. Starting on Tuesday morning sometime and continuing through to Thursday lunchtime. We have had to yet again restrict the 'jira-users' group from creating and commenting on issues. Regular contributors/committers to projects are still able to create and comment on tickets for projects in which they are named in 'roles' Most committers can not yet create INFRA tickets or tickets for other projects in which they are not in a role. The Infra team have banned over 60 IP addresses via fail2ban triggers. Nearly 2000 Spam tickets were created over multiple projects. Over 160 user accounts were either deleted or disabled. We have in place automatic ban triggers for more than one account created using the same IP address in an hour. At this moment the restrictions are still in place whilst Infra works out future options. Modules created for crowd testing (in progress) MATT/Github ------------------- Following a sucessful (quiet) month of having both Whimsy and Traffic Server using the git-dual system, Infrastructure is contemplating adding more projects to the experiment, in part to test out different aspects of the experiment that may not be fully utilized by the current projects, and in part to increase the load on the service to see if anything breaks when we start hitting rate limits. Infra is at present considering adding the Beam incubator project to the test, which would let us experiment with extremely large groups of people (universal commit bit etc). lists.apache.org ---------------------- At ApacheCon - we unveiled the new service https://lists.apache.org - this service is built to replace both mail-archives.apache.org, and mail-search.apache.org. See more details below. Long Range Priorities: ====================== Monitoring --------------- No material updates here due to staffing issues. Automation ----------------- No material updates here due to staffing issues. Technical Debt ----------------------- We've taken the first major step in retiring mail-search and mail-archives. The former is particularly important in our paydown of technical debt as it currently runs on hardware that is 8 years old and reduces the number of operating systems that we are forced to manage. To boot, much of the mail-search platform is undocumented and the current Infra staff have little knowledge on how to operate those services. General Activity: ================= Staffing issues has made this month a bit more mechanical than normal. Uptime Statistics: ================== For detailed statistics, see http://status.apache.org/sla/
Short Term Priorities: ====================== - Ensuring/monitoring that the dual git repository setup works as intended. - Provision an isolated test environment (including isolated LDAP) for developing/testing services faster than is currently possible, under a separate domain (asfplayground.org). We believe this separation will also help projects and volunteers work on their ideas, such as the Syncope trial, as they can then develop and test without interfering with production systems. - Looking at deploying Apache Traffic Server to help alleviate the troubled BugZilla instances and possibly JIRA. Long Range Priorities: ====================== - Mailing list system switchover is expected to happen in the coming months. We are aware of a few outstanding mail-search requests that we have concluded could be solved out-of-the-box by this. - VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission - Further explore identity management proposals General Activity: ================= - git-dual has been fully puppetized and is ready for more extensive testing. - Traffic Server has joined the git-dual experiment, so far without incident. - Due to severe abuse, we had to take our European archive server offline and make extensive restrictions on our US archive. We also had to put the US archive in maintenance mode for a few days in order to relocate it to a new machine that can better handle the load (and isn't 5 years old). - Following concerns raised on infrastructure@ about the loss of history due to the migration of people.a.o to a new host not including the contents of committer's public_html directories, an infrastructure volunteer stepped forward to automate the copying of this data. Of the original ~500GB, around 80% was inappropriate (RCs, Maven repos, nightly builds) and was filtered out. There have been no further concerns raised since the copy of the remaining ~110GB was completed. Ticket Response and Resolution Targets: ======================================= Stats for the current reporting cycle can be found at https://status.apache.org/sla/jira/?cycle=2016-04 The tentative goal of having 90% of all tickets fully resolved in time is still being used. Compared to March, we have had slightly more tickets opened, and the percentage of tickets that hit our SLA is lower, as we are essentially 2 full time staffers fewer than we were in March. Quick April reporting cycle summary: - 230 tickets opened - 228 tickets resolved - 248 tickets applied towards our SLA - Average time till first response: 13 hours (up from 6h in Feb-Mar) - Average time till resolution: 44 hours (up from 23h in Feb-Mar) - Tickets fully handled in time: 90% (204/227) Uptime Statistics: ================== Uptime is not pretty this month, but this is primarily an aesthetic issue. The biggest failure here has been keeping LDAP servers in sync, as we have a specific LDAP server in PNAP that keeps going 10 minutes out of sync with the rest. We are looking into why this happens, but so far we have not been able to determine the true cause. Nonetheless, this does not mean LDAP has been down or unresponsive, merely out of sync on one node. As mentioned earlier, we also had to pull the EU archive due to massive abuse, primarily from some EC2 instances and other external datacenters. We are working with the data centers in question to resolve the issue, and have also imposed a new 5GB daily download limit per IP on the new archive machine. Preliminary data suggests this new limit has reduced the traffic from archive.apache.org by as much as 65% (going from around 3TB/day to 1.1TB/day), further suggesting that the bulk of traffic from that service is to poorly configured VMs or other CI systems that should never have been using the archive in the first place. As the machine that hosted the archive also hosts the moin wiki and mail archives, we believe that these services will now perform better, in lieu of archives.a.o moving. For detailed statistics, see http://status.apache.org/sla/
New Karma: ========== Finances: ========== Operations Action Items: ======================== Short Term Priorities: ====================== - Ensuring/monitoring that the dual git repository setup works as intended. - Provision an isolated test environment (including isolated LDAP) for developing/testing services faster than is currently possible, under a separate domain (asfplayground.org). We believe this separation will also help projects and volunteers work on their ideas, such as the Syncope trial, as they can then develop and test without interfering with production systems. - Looking at deploying Apache Traffic Server to help alleviate the troubled BugZilla instances and possibly JIRA. Long Range Priorities: ====================== - Mailing list system switchover is expected to happen in the coming months. We are aware of a few outstanding mail-search requests that we have concluded could be solved out-of-the-box by this. - VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission - Further explore identity management proposals - Further explore the MATT experiment (see General Activity) General Activity: ================= - Finished the initial design of git-dual.apache.org, intended for distributed git repositories (Whimsy experiment). Things have been fully in sync for now. Some discoveries and gotchas were made in setting this up, such as split-brain issues and canonical source requirements for the setup to sync properly - in particular, it seems that the 'origin' setting in each repository must be set to GitHub for it to sync properly both ways. There is still the issue of a discrepancy between emails for commits pushed to ASF and commits pushed to GitHub, but this is being worked on. - Adding new members to our GitHub organisation has been fully automated and tied to LDAP. Any committer setting their githubUsername field through id.apache.org will automatically be added as a member of our organisation there. We are pleased to say this is/was the final step in fully automating the MATT experiment from the committers' side of things. All adding/removing of members for github repos is now automated completely, if delayed by a few hours due to rate limits (in anticipation of 1000+ repositories, we have decided to do slow updates). - Added JIRA SLA guidelines (as well as a status page on status.apache.org, see below) - We had a bad disk on coeus, our central database server, causing many services to be slow while we replaced the disk and resilvered the mirror over the weekend. Ticket Response and Resolution Targets: ======================================= We have added a new SLA for tickets. We are still tweaking the parameters in this SLA, but the preliminary ones have been applied to our new JIRA SLA page, where stats for the current reporting cycle can be found at https://status.apache.org/sla/jira/?cycle=2016-03 Tickets created on or after February 23rd are counted towards our new SLA. We have a tentative goal of having at least 90% of all tickets fully handled (responded to and resolved) in time. Quick March reporting cycle summary: - 178 tickets opened - 216 tickets resolved - 144 tickets applied towards our SLA (Feb 23rd onwards) - Average time till first response: 7 hours - Average time till resolution: 17 hours - Tickets fully handled in time: 95% (118/124) Uptime Statistics: ================== Nothing out of the ordinary to report here (99.9% across the board). We decommissioned the use of minoatur as a web space provider and have moved people.apache.org to home.apache.org. This has caused a slight rise in uptime for standard services as the old people.apache.org was frequently experiencing issues. BugZilla has seen less abuse than before, possibly because of the measures taken previously. We are however exploring utilizing Apache Traffic Server to further ensure its future stability. Once again, the Moin wiki is, as always, experiencing hiccups, caused by the general flaw in its design. We urge any project still using this to switch to cwiki, if they experience slowness in the service. For detailed statistics, see http://status.apache.org/sla/
New Karma: ========== N/A Operations Action Items: ======================== N/A Short Term Priorities: ====================== - Setting up a new git repository server at the ASF for testing git r/w synchronization viability. - Provision an isolated test environment (including isolated LDAP) for developing/testing services faster than is currently possible, under a separate domain (asfplayground.org). We believe this separation will also help projects and volunteers work on their ideas, such as the Syncope trial, as they can then develop and test without interfering with production systems - Agreement with Apple has been executed, and projects may now publish apps in the iTunes store. Huge thanks to Mark Thomas who has been working on this for multiple years and seen it to conclusion. - Appveyor CI is now available for projects who make use of a Github mirror, see for more details: https://blogs.apache.org/infra/entry/appveyor_ci_now_available_for Long Range Priorities: ====================== - Mailing list system switchover is expected to happen in the coming months. We are aware of a few outstanding mail-search requests that we have concluded could be solved out-of-the-box by this. - VMs on Eirene, Nyx and Erebus to be moved in readiness for their decommission - Further explore identity management proposals General Activity: ================= - Moved the STeVe VM to a new data center in anticipation of the upcoming annual members meeting. Benchmarks show it could serve an election with more than 6,000 voters, so it should be adequate for the upcoming election. - The new machine for serving up ASF's end of the ASF->GitHub r/w repositories is under work, but has been slowed a bit by some difficulty in integrating new hooks as well as difficulty with our current environment setup (see SRPs above). As stated in the SRPs, we have discussed deploying a separate test environment that is isolated from our production environment and would allow for faster development/testing. The project hinges on three phases: ACL, email/webhooks and dual r/w repositories with sync. The ACL has been automated now and is working. The email phase has turned up some faults in GitHub's way of determining whether a commit is unique (and thus deserves a diff email), and thus we are working towards replacing this with a new mechanism. This new method however is dependent on getting the 2xr/w repos up and running, which in turn depends on rewriting the git hooks we have in place for our current setup, and is the cause of the current slow progress. We expect to have a basic work flow with new hooks in the coming week, and once that is confirmed, we'll attach the Whimsy code to that process. - Work continues on moving machines from our old VM boxes to our new provider. - Fixed some issues with buildbot configurations not being applied. - Lots of JIRA activity: 175 tickets created, 146 resolved - Fixed spam filtering cluster so that it runs checks correctly and is able to catch more legitimate spam. This has already proven very effective. - 5 machines and two arrays decommisoned and removed from OSUOSL racks. Uptime Statistics: ================== The new Whimsy VM has implemented a status page that we have tied into our monitoring system (PMB). This is still not considered a production environment, and thus does not count towards our SLA. This month was a slightly bumpy ride for some services. While the critical services continued their nice trend of pretty much 100% uptime, some core services such as people.apache.org have acted very unstable, indicating that we may need to speed up the decommissioning of this service. BugZilla has been abused by some external services trying to scrape everything, causing some downtime. We have implemented a connection rate limiting module (using mod_lua) for the BZ proxy and this seems to have helped, albeit not stopped the downtime competely. The Moin wiki is, as always, experiencing hiccups, caused by the general flaw in its design. We urge any project still using this to switch to cwiki, if they experience slowness in the service. For details statistics, see http://status.apache.org/sla/
Things were quieter this month than normal, largely due to holidays. General Activity: ================= - Added new nodes to our LDAP cluster for more robustness, starting work on adding LDAP load balancers to prevent abusing specific nodes. - Work continues on moving machines from our old VM boxes to our new provider, as well as formalizing existing setups (cinfig mgmt etc). - GitHub organisational changes were applied, in order to continue with the exploratory MATT project. Committers can now be automatically added and/or removed from GitHub teams depending on LDAP affiliation and MFA settings. - 120 JIRA tickets created, 156 resolved Uptime Statistics: ================== Continuing the positive trend since November, we are very pleased to report that uptime across all SLA segments for this reporting cycle was above the famous 'three nines', and the overall uptime - when using the same number of decimal places as most service providers - was at a marvelous 100.0%. That is not to say we did not have our share of service glitches - in fact we had quite a few - but the duration of these (a few minutes each) were too insignificant to budge the total uptime figure. In order to compare better with places using various decimal place settings, we have tweaked our SLA page to accept this setting as an argument, thus: http://status.apache.org/sla/?1 will show uptime as XX.Y% whereas http://status.apache.org/sla/?3 will show it as XX.YYY% etc. There are still places that we cannot or do not yet monitor, more specifically various components in whimsy - mostly due to access restrictions -, and we are in talks with people from the Whimsy project about utilizing a new status page for our monitoring. Whimsy Github Experiment ===================== We've done some work around MATT (Merge All The Things - http://matt.apache.org) and are relatively happy with the workflow around definitively identifying Github and Apache accounts and merging them. (Our process allows for folks to sign into their Github account and the Apache account and confirm the other. That work has been followed up with some automation work so that: We can create groups on the fly, automagically populate them from group membership in LDAP(predicated upon the person identifying their Github ID with MATT), and then grant them commit access if they have MFA (Multi-factor authentication) enabled. We've decided, at least for the time being to mandate MFA - as we have no visibility into authn failures or other auth attacks, and the MFA ability grants us better security than we have on our own hosted repositories. (Currently ~1/2 of Whimsy committers do not have commit access to the repository because they have not enabled MFA.) We still have work today on the Github experiment, mainly around the automation of pushes back to an ASF copy of the repositories. We should have more to report on this front next month.
New Karma: ========== None this month Finances: ========== N/A Operations Action Items: ======================== N/A Short Term Priorities: ====================== - The M.A.T.T (Merge All the Things) project for Whimsy is progressing, albeit hindered a bit by the need to change our entire setup on GitHub. We expect this to be solved within the next reporting period. - We discovered an error one of the monitoring systems we use (a faulty configuration) which had been preventing it from notifying us of some issues such as a bad disk. This has been rectified and hardware replacement should happen before the board meeting. - We are in the middle of moving a lot of VMs away from our old vcenter setup to the new infra provider, LeaseWeb, as well as upgrading them, which will clear out a lot of technical debt. Long Range Priorities: ====================== The mailing system switchover is scheduled to happen within the coming months, with tests being performed at the moment to port lists form the old ezmlm system to the new mailman 3 system. We are preparing to stress-/scale-test the new setup. We are looking into replacing our current backup plans with more affordable ones while still retaining a hardware-managed solution. This will hopefully cut our backup costs in half in the long run. General Activity: ================= * Lack of responsiveness We've noted that our responsiveness to email on infrastructure@ has suffered of late, resulting in dropped work items and failing to meet expectations. While we certainly don't want to manage work via email (far too much of it) it's clear that a number of issues were being lost in the sea of email. We've reinforced that the responsibility of insuring we are responsive to emails falls to the on-call staff member of the week, and hope to see this situation improve. * people.apache.org moving to home.apache.org There has been some discussions going on about the decision to decommission the current people.apache.org server and move to a new machine. Most discussions have revolved around specific methods of access (rsync, scp, sftp) and some poeple have asked why the old data was not copied over verbatim. We are looking into whether RSYNC and SCP can become a reality - so far, our search has shown it's not an easy task to couple those with LDAP) We are not actively looking into copying everything from minotaur (the old host), as it would fill up most of the disk space with unnecessary files (very low signal to noise ratio). * code signing service Discussion has begun about discontinuing the code signing service. Despite a large number of projects requesting the service, to date, only two projects are making use of the service. Moreover, in the past year, we've had a grand total of 34 signing events. The conversation on whether offering a code signing service remained pragmatic with such low uptake began on infrastructure@ - Symantec was notified that we plan to discontinue the service, and has asked to be allow to submit a less expensive option for our consideration. Uptime Statistics: ================== The November-December period has been extremely smooth sailing with uptime across the board reaching the fabled 'three nines' (>=99.90%) and uptime for critical services nearly hitting 100%. For more in-depth detail, please see https://status.apache.org/sla/
Operations Action Items: ======================== Infrastructure discovered that many projects were using git in an innovative manner. The majority of these uses bypassed the normal expectations that we had set about protected branches and tags. As a temporary measure, to get us back to the same level of assurance as expected, we disabled the ability to delete branches and tags. This has caused a bit of murmuring as it is disruptive to the way that many projects use git. Infrastructure awaits guidance about policy around VCS, history, etc. We've added a new cloud provider, Leaseweb, that should provide some additional capacity in the US and Europe. Short Term Priorities: ====================== We've had an increased focus on tickets in the past month. Tickets are being created faster than we can deal with them when viewed from a monthly perspective. Hopefully as we are adding additional capacity we'll address this and get it back under control. Long Range Priorities: ====================== Automation ---------- Automation hasn't been at the top of the list this month. Nonetheless we've made some gains in this arena. We've added a good deal of the generic build slave and buildbot master configuration to puppet. We've also puppetized the new home.a.o service. Resilience ---------- The past few months has given us a good opportunity to prove out our backups and ability to respawn infrastructure. We've gone through multiple moves of critical systems, most notably SVN; proving that our backups work as intended and that configuration management works well for those hosts. Technical Debt -------------- Web space for committers (currently hosted on people.apache.org) is being moved to a new home, aptly named home.apache.org, in the continued effort to phase out the minotaur server. Committers will be given 3 months to move their contents, after which people.apache.org will stop serving personal content and redirect to home.apache.org. We have opted for asking people to individually move their content due to the sheer amount of potentially unwanted old files that currently reside on minotaur (taking up 2TB of space). This will be a web hosting server only, shell access will not be allowed. We plan to set up a VM for PMC Chairs later on, for performing LDAP operations, but the free-for-all approach that minotaur has is unlikely to be offered in the future, due to the unmaintainability of it. Once this is in place, the only other major component we need to remove from minotaur is our DNS service, and we can then retire the aging machine. We will be publicizing the above very broadly in the coming weeks as we start the countdown clock. Monitoring / Logging -------------------- We have had to retire our initial unified logging cluster due to the incredibly high amount of logs coming in (estimated 20 billion entries per year), which was choking on disk read/write speeds, first and foremost. A new 5-node cluster with faster SSD disks has been put in as a replacement, requiring less than 24 hours to set up and put into production thanks to our configuration management systems and some snappy work done by the team. This new cluster is henceforth known as 'Snappy'. We will most likely be cutting down retention time to 3-4 months as a precautionary measure, so as to only store somewhere around 5-6 billion records at any given time. As we mainly need "current" logs (within the last month) for our work, this is an acceptable compromise for us, considering the alternative would be more than doubling the cost of the cluster. General Activity: ================= On-boarding the newest member of the team is ongoing, and has proven to be a very smooth task, with every member of staff pitching in to provide guidance and help. The mail archive PoC has been on a hiatus, with the two lead designers being away on travel at various times, but as the systems have been running by themselves in the background, they have provided valuable debugging information for optimizing and further developing them. We remain convinced that we are on the right track in terms of software to use for the next generation of mailing lists, as we have not seen sufficient evidence of other alternatives operating at the same scale the ASF does. Uptime Statistics: ================== See http://status.apache.org/sla/ for details. The critical services SLA was not met this month due to an LDAP outage that caused our "LDAP Sync" checks to fail for an extended period of time. While the LDAP service was not itself unavailable, we will be looking into better ways to ensure that we act faster when nodes go out of sync or stop functioning. The overall SLA was however met. In practice this had almost no effect on users, but didn't meet our own internal standard.
Date: By: Reviewed by: New Karma: ========== Infrastructure has added Daniel Takamori as a part-time staff member. You can read more here: https://blogs.apache.org/infra/entry/dear_apache This is a backfill of the part-time contract position that expired in May of this year. Finances: ========== I spent some time talking with the Virtual folks about our rate of expenditure in some specific GL Accounts - those used for replacing hardware, cloud infrastructure, and build farm costs. While we are currently close to plan in terms of totals, the rate of spending has increased dramatically, and failing either a reduction in spending, or in-kind sponsorships (one of which we are working on), we will be over budget in those specific categories. That said, we are significantly underbudget overall Operations Action Items: ======================== N/A Short Term Priorities: ====================== N/A Long Range Priorities: ====================== Automation ---------- We've made small, regular strides in increasing our automation efforts, but nothing particularly report-worthy. Resilience ---------- Much work around backups, both improving the scope as well as the quality of our backups is ongoing. We are backing things up, but need to figure out a retention policy, and that work is yet to be done. Technical Debt -------------- We've made significant strides in migrating workloads off of aging machines. While we have a long way to go, progress is occurring. Most notably by end of month we should have everything in place to retire the last remaining machine at our Florida Colocation facility and cancel that contract. Monitoring ---------- After last months significant gains, we are seeing benefits brought on by monitoring, but not necessarily additional gains in the monitoring. General Activity: ================= ApacheCon Europe happened in the past month, and a number of Infra volunteers and staff members attended. We suffered a second LDAP outage, that required editing corrupted data and restoring it. This caused a brief outage. Mail system proof of concept work continues, though the largest blocker at the moment is ensuring that a broader-than-the-ASF community cares about Ponymail. Uptime Statistics: ================== See http://status.apache.org/sla/ for details. Despite some needed LDAP maintenance that caused a brief outage, overall we met or exceed all of the SLAs this month.
Operations Action Items: ======================== Short Term Priorities: ====================== DDOS ----- We experienced what we believe to be a DDoS attacking the mirror redirection CGI script this month. This drove our 15 minute load average to 2700+ We ended up using a much more efficient redirection script, and redirecting all queries to the old one to the new to resolve the issue, and it took us around 12 hours to mitigate the attack. LDAP ---- During the reporting period we lost an LDAP server inadvertently, and this caused a number of services to cease being useful. However, our alerting detected the problem in a timely manner, and thanks to our resilient architecture and configuration management, we were able to provision a new host and have it working again within 12 minutes. A 12 minute Mean-Time-to-Recovery is a stunning statistic. Long Range Priorities: ====================== Automation ---------- See the monitoring section for details on how we've automated the blocking of abusive traffic. Resilience ----------- We didn't add much in the way of resilience, but we have a great example of how our resilience allowed us to quickly recover from a failure. See the short term section above. Technical Debt --------------- See details of mail in the General Activity section Monitoring ----------- This month a lot of investment in monitoring over the past quarter has come to fruition. First - we have finally promoted centralized logging into production. This has given us tremendous additional insight thanks to the visualization and query tools that are now available. We did run into some scaling issues with the 'preferred logging tool' and were able to move to a very simple python-based log-ingester, that works on both our new puppet-managed machines as well as our legacy machines. Once we had data in place, and ability to run analysis on the fly, we immediately saw a number of situations where our services were being abused. Eventually, we determined that we could programmatically deal with a number of these issues, across all of our machines. To that end, we've now deployed a tool called blocky that, based on input from our logging system automatically blocks IP addresses across our entire infrastructure. We have a catalog of how blocking this abusive behavior has dramatically reduced bandwidth usage, in one case 30% of a server's total bandwidth was caused by abuse. In addition to the technical benefits, we can also provide insight to projects and fundraising for how much traffic is visiting our web properties, or even a specific project's site, and where that traffic is coming from and what they are doing most often. General Activity: ================= Mail Phase 2 ------------- Mail has been interesting, we went from a very promising POC, to realizing at least one component would likely not be able to scale to match our historical load, much less be able to scale to the future. In general, while we don't currently see any blockers, we are hearing of troubling experiences from others. In response to that, we've developed a prototype of a replacement called Ponymail. It can certainly handle the load. Our plan is to move this software to the ASF, and ensure that it can develope a community around it. We've already called attention to it with some similar organizations who are going through mailman3 POCs. We will not adopt this software if a community of folks other than ASF Infra who cares and helps to develop this software. We want to make sure we aren't replacing an aging system with additional technical debt that will come back to haunt us in 5-7 years. Uptime Statistics: ================== We met this months service level expectations http://status.apache.org/sla/
New Karma: ========== None Finances: ========== LastPass: $168 Dinner meeting with Lance from OSUOSL $65.50 Cloud Services: $4619 (The cloud services cost was somewhat inflated by the fact that we purchased reserved instances (roughly $1k was due to this.) However, it reduces our long term payout somewhat dramatically) Operations Action Items: ======================== Short Term Priorities: ====================== - Sort out mail archives not updating for certain lists - Figure out and solve emails supposedly being denied from certain senders Long Range Priorities: ====================== - Mailman/HyperKitty proof-of-concept, implement ASF account design into it - Revisit unified logging with a new machine setup - Fully deprecate the remaining few large non-config-mgmt boxes General Activity: ================= Mail archives ------------- An issue has arisen where certain mailing list archives have not been updating with emails from August. The issue has been narrowed down to the mod_mbox list database not updating, despite the raw data being present on the archive servers. While the investigation is not complete, we have uncovered some permission and shell environment issues that are related to it, and expect to be able to solve this issue before the next report. infra@ ML split --------------- It has been suggested, and agreed upon, that the infrastructure mailing list be split into a development and a commit part, so as to not burden people, who are only interested in the former part, with the latter. This split is expected to be implemented in the coming weeks. firstname.lastname@example.org -> email@example.com firstname.lastname@example.org -> email@example.com firstname.lastname@example.org -> email@example.com firstname.lastname@example.org -> email@example.com firstname.lastname@example.org -> email@example.com The root@ address is not a list (but an alias) and will remain so. We intend to keep the old addresses working, but forward them to the new list addresses. Also, as infra@ is currently privately archived we will not make those archives public. Going forward however, they will of course be made public. Mailman3/Hyperkitty update -------------------------- Work has been progressing, and the most recent activity has been a discussion on how we will implement authentication and maintain the concept of /private-arch/ (where foundation members are entitled to interrogate any private mailing list). This feature is fairly unique to the ASF and as such no mailing list provider has this capability natively. The PoC has been stood up, and is accepting mails for a couple of test domains as soon as the platform is ready for people to look at it, and test it, further information will be shared at that time. Monitoring ----------- We're now using Datadog as a SaaS monitoring service. It's done a decent job of getting us a decent baseline of metrics. This has given us an increased level of visibility, at least into the systems that are managed by configuration management. Uptime Statistics: ================== Going forward, uptime statistics, as they relate to our SLAs, have now been fully automated and can be found at: http://status.apache.org/sla/ While primarily done to save us from using 3-4 days of contractor time on these statistics every year, we also felt that there were no compelling reasons to not have this publicly available ahead of report time, as both our SLAs and the uptime data itself have (technically) been publicly available for a long time in its raw form. To sum up quickly, all service level goals have been reached for the past 5 reporting cycles: --------------------------------------------------------------------------- Cycle: | Critical (>=99.5%) | Core (>=99%) | Standard (>=95%) | Average --------------------------------------------------------------------------- Mar - Apr | 99.52% | 99.81% | 99.14% | 99.58% Apr - May | 99.83% | 99.96% | 99.22% | 99.75% May - Jun | 99.72% | 99.67% | 99.32% | 99.59% Jun - Jul | 99.50% | 99.13% | 99.65% | 99.34% Jul - Aug | 99.50% | 99.87% | 99.80% | 99.77% --------------------------------------------------------------------------- Contractor Details: ===================
New Karma: ========== None Finances: ========== $2930 - Rackspace cloud $60 - Dotster $66 - Hetzner $3336 - AWS Operations Action Items: ======================== N/A Long Range Priorities: ====================== Automation ----------- While the general spread of configuration management continues, the major increase has been with adding Confluence as a puppet managed service. The Buildbot master and Jenkins master have started the move to being completely managed by configuration management. Resilience ---------- This month we've reduced the disparate number of SSL proxies. We now have three, identical, SSL end points, all capable of serving the same content. With some upcoming work around GSLB, this should prepare us to have an additional level of resilience for the many services behind the proxies. Monitoring ---------- We didn't make much progress on this front this month. Technical Debt -------------- Our GeoDNS instance failed this month. This service routes users to the closest web server, SVN deployment, or rsync host based on where they are in the world. The underlying host went offline and we were unable to resurrect it. We are awaiting assistance from smart hands in Europe to do so. In the meantime we've disabled the service, which is shunting all users to single instance of thse services. We'll be moving this service to an external provider with substantial redundancy in the future. General Activity: ================= Bugbash ------- Infrastructure has run two bugbashes in the past month. Over the course of both days, 39 bugs were resolved. Buildbot Security Issue ----------------------- As noted in the blog, we were alerted to some aberrant network traffic originating from our buildbot master. While we were able to fix the underlying issue we also realized that an abundance of caution dictated that we should rebuild the machine. We also were approaching the EOL of the hardware. This led to the decision to take the pain and rebuild it. As of this writing the post-mortem on the incident hasn't occurred. But, findings will be reported when it happens. Jenkins ------- Our Jenkins master has been suffering from disk I/O issues and we opted to change the underlying file system. While we were down for that operation we took the opportunity to begin the process of puppetizing the host. The restoration of data to the host took much longer than planned, but the final outcome appears to be performing much better. Uptime Statistics: ================== Contractor Details: =================== Gavin McDonald - Oncall Duties: remove some snapshots on hades to free up some disk space. We are only keeping around 1 months worth of snapshots on all repositories now due to the space issue. - Upgrade Confluence Wiki to latest version - documentation created in cwiki - Migrate Buildbot to a new buildbot-vm - documentation created in cwiki - Begin work on and local testing of buildbot master module - 66 Jira tickets worked on closing 53 - some longstanding CMS related tickets resolved.
New Karma: ========== n/a Finances: ========== $3150 - cloud related expenses $17.49 - domain renewals Operations Action Items: ======================== n/a Short Term Priorities: ====================== SSL --- The issues reported last month seem to have been dealt with earlier in the month. SVN --- Disk usage on our SVN host became an issue. We attempted a service migration to a cloud-based host with more storage. We migrated the service and then discovering problems that weren't found in testing; rolled back. Long Range Priorities: ====================== Automation ---------- Moving to RTC for our configuration management code has been an interesting exercise, it certainly hasn't found all of our issues but is forcing cognizance of what and how folks are getting things done. In addition we are seeing a number of issues fixed during review, before they get pressed into service. Resilience ---------- We made little progress on improving our resiliency this month. Technical Debt -------------- We made little progress on reducing our technical debt. Monitoring ---------- This month has seen a sharp rise in false positives in monitoring. This naturally adds to the workload of the oncall person, and is frustrating. Work is ongoing to reduce the amount of false positives that wakes folks up. General Activity: ================= As of June 1, Chris Lambertus and Geoff Coreys are now employees within the Virtual PEO. Uptime Statistics: ==================
New Karma: ========== n/a Finances: ========== $3013 - related to ApacheCon $422 - hardware replacement $2350 - Cloud expenses $60 - Registrar fees Operations Action Items: ======================== This month has seen work around onboarding the two US-based contractors as employees. This involves background checks, reference checks, as well as paperwork. That looks to be largely complete. It is likely that the two US-based contractors will be employees effective June 1. Short Term Priorities: ====================== Websites ---------- Git-based websites are now possible, and seem relatively popular. 8 projects are now using the gitwcsub services, and we are receiving ~2 requests per week. SSL ---- SHA-1 based chains have been deprecated. Chrome and Windows now show alerts for SHA-1 based certs, or certs with SHA-1 certs in the chain. Due to this we spent a good amount of time swapping out SSL certs this month. Despite switching out, this hasn't been completely trouble free. Some git binaries for Mac or Windows seem to be having difficulty, and this is a problem that continues to be worked. Long Range Priorities: ====================== Automation ---------- This month has seen some automated testing enabled for our configuration management repo. In addition, we've moved to RTC from CTR for most changes - while this isn't final, the change at this point appears positive. Resilience ----------- The big boost this month in resilience comes in the mail infrastructure. See the comments in the mail section. Technical Debt --------------- Monitoring ------------ While most of our monitoring has been working we continue to have issues on our legacy systems that leave us without an ability to monitor. Additionally, some of our applications are needed deeper introspection for specific functionality, and we have yet to cope with that. General Activity: ================= Mail ----- This month has seen the culmination of phase 1 of our mail overhaul. Many thanks to the SpamAssassin community and Kevin McGrail in particular for providing insight into what they see as best practice. Phase one is focused on our MXes, spam and virus processing. In some ways, we've overbuilt the current deployment - our new architecture can scale horizontally allowing us to handle many times our current mail load. Much work continues to be done on this front, and should be seen as phase 2 materializes. Uptime Statistics: ================== Overall, the total uptime for 2015 increased by 0.02% this month, in what has generally been a quiet month in terms of emergency maintenance and downtime. The few services that performed badly this month have been mentioned in earlier reports, and steps are being taken to increase the availability of these services in the long run. Type: Target: Reality, total: Reality, month: Target Met: ---------------------------------------------------------------------------- Critical services: 99.50% 99.62% 99.83% Yes/No Core services: 99.00% 99.88% 99.95% Yes Standard services: 95.00% 98.96% 98.91% Yes ---------------------------------------------------------------------------- Overall: 98.59% 99.57% 99.64% Yes ---------------------------------------------------------------------------- Contractor Details: =================== Gavin McDonald - Oncall Duties: Whimsy died 04/24 and needed a reboot. Crius disk space reached 75%. - Some buildbot slaves offline, look into issues and bring back online. - Look into various projects Buildbot failures and resolve, liaising with the projects as necessary. - Reviewing and Merging others branches to deployment. - Crius and Hemera were 100+ packages behind, 3/4 of which were security updates. - Updated both then enabled auto-patching of security. - More work on blogs_vm module, with others checking and merging - Setting up a more robust local puppet module testing env - Investigate and fix PagerDuty alerts for various services - Investigate Moin Wiki ongoing issues and start to compile a report. - Reduce Crius disk space to 70% - Clear space on analysis-vm (sonar) - 57 Jira tickets worked on closing 40 - Various Cron mails looked into and resolved (mainly new cert errors)
New Karma: ========== None Finances: ========== $53 domain renewals $2165 Cloud Services Operations Action Items: ======================== n/a Short Term Priorities: ======================= LDAP ---- We've increased our spread of LDAP so that all of our cloud regions now possess an LDAP server for authentication to work over the local network. CMS ---- We've generated a FAQ to catalog issues and questions that came up during the RFC. We hope to have some direction during or shortly after ApacheCon. Long Range Priorities: ====================== Automation ----------- We've been working on a number of automation efforts. One of our recent deployments of LDAP has proven we can deploy a new host in less than 10 minutes from provisioning to functional service. Additionally we've been working on moving more services. One of the highlights this month includes the SteVe deployment. We now have the service in a state where it is trivial for us to deploy a fresh machine with a STeVe deployment for projects to use. Resilience ---------- We've begun moving VMs off of our internal VMware deployment. At the same time we have been working on spinning up our VMware-based cloud deployment at PhoenixNAP. Technical Debt -------------- We suspect (but are unable to prove) that some of our VMware issues are related to using EOL/EOS software from VMware that requires us to run the deployment for at least one machine in a suboptimal manner. Monitoring ---------- While monitoring has been useful in identifying issues, we haven't materially expanded its use this month, something we hope to remedy this month. General Activity: ================= Incidents ------------ We had a total of 98 incidents in the month of March that we alerted on and paged the on call contractor out for. The relatively high number has caused some concern, and we are tracking ways to reduce the number of alerted incidents to the truly severe. http://s.apache.org/WjC http://s.apache.org/jQs Jira ----- A number of Jira imports have successfully been handled. These include the plethora of Maven project imports, Tinkerpop, and Groovy among others. The bulk of this work has been done by Mark Thomas who deserves special thanks for tackling this work. VMware hosts ------------- One of our VMware hosts has been suffering from intermittent network failures as well as the entire machine dying. This originally appeared to be time related. We've begun migrating services off of this host in order of priority, but the outages have affected a number of important services to include the git services as well as blogs. Uptime Statistics: ================== Contractor Details: =================== Gavin McDonald -------------- Engage with OpenOffice community regarding a few Infrastructure related issues, such as the Mac Buildbots Revisit and engage with CouchDB PMC to sort out 2 outstanding Domain Name transfers. Sort out couch.[org|com] domains with Dotster, Domains are successfully transferred. Next step is to sort out secondary DNS and httpd configs. Covered on-call whilst folks were travelling to ACNA Jira Tickets: 68 worked on and 56 competed as of 04/20/15 Confluence Wiki migration was completed. Began work on moving the Blogs service to Puppet 3 and a new home in the cloud. Began work on preparing to migrate TLP playgrounds to Puppet 3 and the cloud. Prepared a draft policy covering VCS canonical locations. on 4/17 qmail stopped sending mails and the queue went above 100000. Resolved issues and service running normally again. commonsrdf cms setup refuses to create a staging site. Looked extensively into the problem but remains unresolved at this time. This is a blocker for the project as they have no website and are waiting on it to do a release. See: INFRA-9260
New Karma: ========== N/A Finances: ========== $135.40 - Hetzner.de $285.00 - Silicon Mechanics $1267.11 - Amazon Web Services $17.49 - Dotster Operations Action Items: ======================== N/A Short Term Priorities: ====================== Codesigning ----------- There have been several signing events in the past month. 4 total signing events for Tomcat 8.0.19 and 8.0.20. 1 total signing event for OpenMeetings 3.0.4 Machine deprecation ------------------- We continue to make progress in moving services off of some of our oldest hosts. VMware issues ------------- We experienced repeated network failures on one of our newer VMware hosts. This took down most or all services on the host repeatedly. This appears to be related to time issues, and seems to have settled down, though we continue to watch it closely. Backups ------- The new backup service is moving along well. 7 of our hosts are currently backed up by the new service. Our goal is to have all of the hosts backed up by end of the year and deprecate our Florida colo and the machine running there. Long Range Priorities: ====================== Automation ---------- We've made good progress on a number of automation fronts. The work to automate Confluence has been done, and the service will move hosts in the coming weeks. A number of our SSL Endpoints have now been puppetized, as have services like asfbot, svngit2jira, and gitpubsub. Resilience ---------- We haven't made a lot of progress on this front in the past month aside from the ongoing automation efforts. Technial Debt ------------- We've run into a bit of technical debt surrounding CGI pages. This had a deleterious affect on a number of projects. Several pieces of debt from this were discovered. The first was that a contractor had been manually setting the executable bit across hosts. The second was that the CMS was not picking up and communicating permission changes even when set appropriately. There had been a custom CGI module written in the past, and we initially spent a large amount of time trying to get that to work in our new environment to much frustration. In the end, we decided that having to bear responsibility to maintain a custom module was more technical debt that we were amassing. In the end we set the executable bit for a number of projects by using svnadmin karma. We believe these issues to now be resolved. Monitoring ---------- In the past month we've uncovered a few holes in our monitoring that has resulted in users pointing out problems to us. We are still working to fill those holes. General Activity: ================= CMS ----- As mentioned in last month's report; we are struggling to find a solution for the CMS which is on an aging physical host. We've sent a request for comments on some of our proposed solutions to PMCs and operations@ http://s.apache.org/iL4 Uptime Statistics: ================== Overall, the total uptime for 2015 fell by 0.05% this month, mostly contributed to by some issues with one of the VMWare hosts, as well as people.apache.org growing slightly more unstable (though nothing alarming). We currently have two services with uptime below the SLA (people.apache.org with 98.83% and the moin moin wiki, with 93.31%), while the remaining 26 in the uptime sample are above the SLA. The bad performance of the wiki is due the fact that the moin moin wiki was not designed to scale well, thus resulting in a lot of >30s response times - not errors per se, but still counted towards its uptime. Type: Target: Reality, total: Reality, month: Target Met: ---------------------------------------------------------------------------- Critical services: 99.50% 99.65% 99.44% Yes/No Core services: 99.00% 99.83% 99.84% Yes Standard services: 95.00% 98.95% 98.82% Yes ---------------------------------------------------------------------------- Overall: 98.59% 99.55% 99.47% Yes ---------------------------------------------------------------------------- Contractor Details: =================== Daniel Gruno: Non-JIRA activities: - Moved (and puppetized) services from urd (baldr) to a new host: - gitpubsub - svngit2jira - asfbot - Weaved reporter.apache.org into Marvin - Helped debug and fix issues with Marvin not working - Fixed some issues with the URL shorterner (bad regex) - Updated documentation for Git services (git-wip + mirrors) - Fixed API (3rd party change) for our status page - Fixed an issue with the project services monitoring not firing off alerts - Set up a replacement for aurora for www-sites. - Experimented with the CouchDB project on a new gitwcsub service for web sites, allowing projects to use git for their web sites instead of svn - On-call duties Geoffrey Corey: - Resolved 31 JIRA tickets - Fix mail relaying from hosts not in OSU network (bugzilla host) - Work with AOO to get back missing buttons/options in their bugzilla - Work on back log of git related JIRA tickets - Work with GitHub support to turn back on the mirroring service - Create mysql instances in PhoenixNAP for new VMs in there - Migrate pkgrepo to bintray (still need to figure out GPG signing in a proper way) - Work with Gavin on the Puppet 3 confluence wiki module and its dependencies - Create puppet manifest for erebus-ssl terminator/proxy rebuild and finish up testing for it - On-call related duties Chris Lambertus: - time off due to personal/child care - On call duties - General assistance to other contractors - Troubleshooting and diagnostics for ongoing lucene git issues - troubleshooting and diagnostics for ongoing eirene VM host issues - completed base functionality of zmanda project - resolved issues with zmanda+vmware connectivity - resolved issues with abi reaching storage limits (add monitoring) - extensive tuning of zmanda system Gavin McDonald - Worked on 30 tickets closing 23 - More Confluence wiki puppetesation - Migration work of Confluence to a new new home (99% complete.) - Various other general support, looking at VM issues
New Karma: ========== none Finances: ========== 17.49 - Domain name renewals 979.12 - Amazon Web Services As a side note, we expect to spend dramatically less on hardwware than we originally budgeted. This difference is coming about due to a number of issues: Our build farm has been dramatically subsidized thanks to Yahoo who have provided ~30 physical machines (a long with hosting the hardware and providing smarthands support). We also now have multiple cloud providers giving us extensive credits, and we are moving a number of services to public cloud providers. This does mean a slight change from capital expenses to operational expenses, though I don't think that from our perspective that it matters much. Operations Action Items: ======================== n/a Short Term Priorities: ====================== Codesigning ----------- Tomcat generated two signing events this month, both for Tomcat 8.0.18 Machine deprecation ------------------- We've made significant progress in moving services off of some of our oldest hosts. In doing so we've also have spent a good chunk of time automating these services and making our deployment more robust. See more on this issue in the section on Automation in Long Range Priorities. Backups ------- We are just now beginning the deployment of a new centralized backup service. The client installation as well as the server has been autoamted, expect to see more in this space in the coming months. LDAP ---- The LDAP service has largely been rebuilt from the ground up. For background, the old machine that formerly served as our svn master was also one of our LDAP machines, and when it failed, we were down to one very poorly performing instance in the US. Subsequently, we've rebuilt the entire service using configuration management, we again have two LDAP hosts in OSUOSL, that are easily handling the load. This has sped up many of the authn/authz actions that were slow last month. We've also deployed new LDAP instances to several of our cloud zones. While the services seems to be working well at this point, we did have a few hiccups, where the old LDAP servers were removing newly created accounts. The older LDAP servers were left in place after we repeatedly found a number of services that had either minotaur or harmonia hardcoded as LDAP servers. Because of these problems we removed the legacy instances, and are dealing with the problems caused as we find them. Bugzilla -------- Our three Bugzilla instances were still running on a 6 year old machine but have since been successfully puppetized and migrated to VMs in one of our cloud accounts. In the process we've worked fastidiously on improving the software deployment mechanism (software is now deployed as an OS-native package). Long Range Priorities: ====================== Automation ---------- Significant progress to report this month. As indicated above, LDAP machines are all under configuration management. Additionally, we migrated all three of the Bugzilla instances and the git repositories to being completely managed by configuration management. After finding some problems with some of our CM-managed instances, we've adopted a process of destroying and recreating a service as a verification step prior to pressing a service into production status. Naturally the new services we are bringing online, like the backup service are all managed under CM. Technical Debt -------------- The move of LDAP has uncovered a lot of hardcoded values and we've been working to pay that off. (referring to service names rather than specific machine identities, putting configuration in configuration management where possible. General Activity: ================= Bintray ------- As of January 30th, we enabled Cassandra's debian repository on bintray and have been monitoring it closely. The service seems to be working well and appears to have fulfilled the needs of reducing our overall webserver traffic and has the bonus of giving more insight into the downlaods. You can see some of the statistics on the dashboard here: https://s.apache.org/bintray1 https://s.apache.org/bintray2 Maven ----- A lot has been happening around Maven this period. We've successfully been able to sync a copy of the Maven central repository, and are working to provide access to that store for the Maven PMC. Additionally, Mark Thomas along with folks from the Maven PMC have been working on migrating the Maven contents of the Codehaus Jira instance to the ASF instance. Good progress has been made here, but much remains to be done. Uptime Statistics: ================== We experienced an issue where status.apache.org was reporting erroneous uptime statistics due to two unused LDAP checks, which unfortunately made it into the weekly ASF blog posts. Other than that, services have been running fairly smoothly aside from the Moin Moin Wiki which has experienced high load times for a couple of weeks. We are discussing what to do to remedy this. Overall, the total uptime for this year grew by 0.03%: Type: Target: Reality, total: Reality, month: Target Met: ---------------------------------------------------------------------------- Critical services: 99.50% 99.78% 99.82% Yes Core services: 99.00% 99.83% 99.92% Yes Standard services: 95.00% 99.02% 98.93% Yes ---------------------------------------------------------------------------- Overall: 98.59% 99.60% 99.63% Yes ---------------------------------------------------------------------------- Contractor Details: =================== Chris Lambertus - on call duties - closed 3 jira issues - extensive work on centralized backup deployment - ongoing testing and validation of zmanda evaluation - puppet work to build fully configuration management deployed host - resolved hardware issues with oceanus/FUB/Dell Germany - ubuntu libc vulnerability patching Geoffrey Corey: - Resolved 25 JIRA tickets - Build out supporting environment in Puppet for bugzilla migration (sql database, webserver/proxy, bugzilla package building, etc) - Migrate and deploy bugzilla instances off baldr and into VMs - Investigate with others about missing LDAP accounts (and subsequently recreate) after LDAP server rebuilds - Fix dist.apache.org authorization template regeneration (related to svn master rebuild) - Work on getting postfix alias management in Puppet - Begin learning buildbot related things from Gavin Daniel Gruno: - Resolved 30 JIRA tickets - On-call duties - Worked on GitWcSub, the git version of SvnWcSub for potentially enabling git repos to act as web site sources - Puppetized and tested deployment of GitWcSub - Fixed a bunch of issues with the main MTA - Worked with others to resolve the aftermaths of the LDAP network redesign - Fixed some issues with uptime reporting and alert statuses on status.apache.org - Fixed some issues with hardcoded values in the PMC management tools - Reached out to contacts about the DNS system overhaul Tony Stevenson: - A lot of my time has been spent on 3 major tasks: - The rebuild of the LDAP service due to the poorly performing incumbent instances. This combined with the retirement of eris (old svn-master) following a terminal hardware fault, we were limited to 1 LDAP instance in the US and proved too much for minotaur to cope with. It appears that the version of slapd on FreeBSD on minotaur leaked memory at a phenomonal rate. The new LDAP service has been moved to the latest slapd available in Ubuntu 14.04, it has been fully built using puppet, and configured so that new LDAP hosts can be added with ease. - Continue preparation and understanding for a rebuild the email infrastructure. This has mostly taken a back seat but has been brought up to the top of my todo list now following the completion of the LDAP task above, and the CMS task below. - In line with our current policy of retiring hardware that is over 4 years old, and trying to make all services pupept managed; David asked me to review the CMS zone on baldr (which is now >7 years old) to ensure we can move the service and have it managed with puppet. However upon investigation it quickly became apparent moving the service was far from trivial given the requirement for ZFS alone. Further reading of the code highlighted some areas of concern for me that I felt needed highlighting as they would likely carry over technical debt into the future and that is something we are working extremely hard to remove. On the back of these findings and my inherited knowledge of the service I presented David with 3 options of how I thought we could manage the service going forward along with the estimated costs, and the pro's and con's of each option. There were: 1 - Move CMS to another FreeBSD host - undesireable given our current trajectory of moving away from FreeBSD, it meant entirely replicating the FreeBSD 9 jail too. This was the path of least change, but perhaps most difficult. 2 - Move the service to Ubuntu and fully puppet controlled the beginning. This might have been the ideal scenario, but the underlying hardcoded FreeBSD aspects, the need for ZFS (which is present in Ubuntu), and the very specific perl that is in place I felt this would take a long time to complete and would be significantly error prone. We would very likely miss something and this would need to fixed on demand. My confidence level was low that we could execute a clean migration. This would be the most time consuming option, but if retention of the CMS was important the best option. 3 - Deprecate the CMS, and allow projects to determine their own publishing/transformations options. We would still require use of pubsub technology, but projects can commit into that their HTML, either directly edited or dervied from markdown whcih they can keep in their project repo (some of the finer details will need to be worked out later. This would essentially be the cheapest option, remove technical debt, and while the timeline would be many months contractor/volunteer time could be kept to a minimum.
New Karma: ========== Finances: ========== 1429.14 - Amazon Web Services 3969.00 - Carbonite/Zmanda As a side note, the new cards have arrived, which provide a much better level of insight into spending; many thanks to the office of Treasurer and EA for chasing this. Operations Action Items: ======================== N/A Short Term Priorities: ====================== Codesigning ----------- Another project has requested code signing functionality. (UIMA INFRA-9002). Four signing events occurred in the month. Two events each for Tomcat 8.0.16 and 8.0.17. Machine deprecation ------------------- Work continues (and was slightly hindered by the holidays) on deprecating the host the runs the writable git service and bugzilla. Backups ------- Over the past several months we've found a number of services where backups were either failing or not happening at all. We've spent a good deal of time focusing on auditing backups and looking for a new solution that gives us better visibility into the success or failure of backup jobs. To that end we've selected Zmanda as our platform of choice and have begun deploying it. LDAP ---- LDAP has emerged as a priority during this month. The loss of the machine that served as the svn master last month reduced the number of LDAP servers in our Oregon Colo to 1, and that instance is consistently under heavy load, and logins to most services are taking significantly longer as a result. We've had a lot of work in process, and only recently began tackling the issue. Long Range Priorities: ====================== Monitoring ---------- Work is continuing on monitoring, with a good leap forward this month. We are taking advantage of (and contrbuting to) a project by the name of dsnmp that queries the Dell OpenManage SNMP/WBEM frameworks as well as the overall operating system health for machine. This provides status checking to alert us to issues that have frequently resulted in outages or service degradation. This month we were alerted to multiple issues that we were able to address before they resulted in outages. This is not a panacea, nor are we done with monitoring efforts, but we are in a much better position now. Automation ---------- Automation progress continued, though slowed somewhat by the holidays. The writable git service is now handled by configuration management. Currently the following services are in progress or in final stages of testing: blogs.a.o Bugzilla Resilience ---------- We haven't made much progress on this front in the past month, aside from th ongoing automation efforts. Technical Debt -------------- As part of our automation efforts, we've been able to generate recreatable Debian packages for our customized version of Bugzilla. The long term plan is for us to have a private build job that builds new packages anytime the source code in our tree for Bugzilla is modified. General Activity: ================= We continue to explore the package repository service and have folks from Cassandra currently working on moving their existing deb repository from www.a.o/dist to this service as a pilot. Traffic about the pilot has generated interest from other projects. Addressing a long standing todo that came out of the Heartbleed vulnerability, we now have an enterprise account with Symantec that will allow us to provision certs on demand with no interaction required for apache.org and openoffice.org. We've migrated a traffic intensive service from one provider to another to minimize our cloud hosting expenses. (Currently 3/4 of our cloud infrastructure expense is from egress traffic). The new provider has a different fee structure that should result in noticeably lower service charges. Maven has requested, and we've agreed to host, an ASF copy of the Maven Central repository from Sonatype. Work is starting around that, but is still early. This is expected to cost in the range of $600 per annum. We began discussing deprecating translate.apache.org as the service is provided for only 4 projects, and has only one volunteer doing the work of administering the service. Additionally, a number of l10n services advertise free l10n hosting for OSS projects. Many of our projects are already making use of those free offerings. Uptime Statistics: ================== We have revamped our uptime charts a bit, added some services and removed some deprecated ones. Our current overall target is 98.59% for these samples. This month, the overall uptime was 99.61% with critical services achieving 99.81%. All of the downtime here was due to moving the writeable git repos to a new machine. Type: Target: Reality: Target Met: --------------------------------------------------------- Critical services: 99.50% 99.81% Yes Core services: 99.00% 99.66% Yes Standard services: 95.00% 99.18% Yes --------------------------------------------------------- Overall: 98.59% 99.57% Yes --------------------------------------------------------- Contractor Details: =================== Geoffrey Corey: - Resolved 26 Jira Tickets - Clean up some TLP server realted puppet modules to require no input and make sure deployment is a 1 step process (also allows svnwcsub use for services such as status.apache.org) - Add logic to puppet that deploys Dell OMSA to physicall Dell hosts for monitoring - Fix svnpubsub not updating www.apache.org/dist entries (related to svn master rebuild) - Coordinate with OSUOSL to replace disk in Arcas - Research fpm to build an ASF bugzilla Debian package whenever the source tree changes - Create bugzilla puppet module to deploy ASF's different bugzilla instances - Complete TLP graduation for Falcon and Flink - Various on-call duties Chris Lambertus: - Resolved 2 Jira Tickets - On call (xmas) - Resolved a number hardware issues with erebus (bad dimm) - Installed and coordinated restarts for OMSA on Eirene - Cleanup of collectd configuration in Puppet to apply collectd to any system - Deployed new status.a.o at RAX - Installation, configuration and evaluation of Zmanda and other backup tools - Initial documnetation of zmanda license count and tally of existing storage - Oceanus troubleshooting and coordination with Dell/Dell Germany to get warranty location updated and parts shipped to the right place (FUB) - Initial ESXi configuration to enable WBEM monitoring (eirine) Gavin McDonald: - Worked on 53 tickets closing 34 - Work on puppetising TLP VMs and Blogs Daniel Gruno: - Assisted Tony in moving the writeable git repos - Orchestrated the move of projects.apache.org from Infra to ComDev - Improved monitoring of ZFS pools - On-call duties - Ongoing discussions on svn redundancy setup, corporate offers and DNS setup
New Karma: ========== Andrew Bayer was added to root@ Finances: ========== $758.48 Windows License $761.27 Replacement Harddrives $1607.78 Service contracts $1247.95 AWS Operations Action Items: ======================== n/a Short Term Priorities: ====================== Codesigning: ------------ A number of projects inquired about codesigning at ApacheCon with intent to sign up and request access to codesigning. In the last 30 days no releases have been signed. Machine deprecation: -------------------- The failure of the machine underlying the SVN master has caused some reprioritization of work and evaluating our older physical machines for the important services that run there. There is ongoing work to relocate the writable git service as well as the Bugzilla machines off of the rapidly aging hardware. Long Range Priorities: ====================== Monitoring ---------- Work continues on monitoring, with several advances being made The first is work with collectd that is being deployed to all of our puppetized-machines. This gives us insight into performance metrics of the machine. Additionally, we've managed to get Dell's OMSA platform for monitoring the underlying hardware deployed to a number of physical hosts. This information is being utilized to monitor for things like failed disks, failed power supplies, and see other overall health information like ambient and internal temperatures. We've gotten a platform that integrates PagerDuty, Hipchat, and email alerts for our Dell physical hardware that has OMSA installed. You can see the code here: https://github.com/humbedooh/dsnmp We are also expanding the radius of hardware notifications from root@ to infrastructure-private. All of this while reducing that down to a single email per day. Automation ---------- We continue to make large steps forward in our automation efforts. The following services are now in configuration management. Several other are currently in progress. rsync.apache.org Git mirrors Subversion master status.a.o website Additionally, some of our work from last months' work around getting the webserver into configuration management has been adopted upstream. https://github.com/puppetlabs/puppetlabs-apache/pull/939 Resilience --------- We now have a second critical service that we have the ability to easily replicate and redeploy. Technical Debt -------------- We are encountering lots of touchpoints that are tied to specific machine names, specific file locations on specific machines and a large chunk of our time is spent in having to decouple this. General Activity: ================= This month has been extraordinarily taxing on Infrastructure, with three large outages or service degradations occurring following ApacheCon. Several volunteers and contractors traveled to Budapest to attend ApacheConEU. In addition to a dedicated track around Infrastructure, an Infra hackfest table was manned where folks could come and ask questions or get help. Many folks took advantage of this. While at ApacheConEU, a number of volunteers and contractors met with folks from OpenOffice to talk about existing and future needs. Details of this meeting can be seen in Andrea Pescetti's notes: http://s.apache.org/aooinframeetingreport Git mirroring (including git mirroring to github) suffered some disruption immediately following ApacheConEU. The 6+ year old machine that the service was running on was shared with 3 project zones, and as the number of git mirrors has increased over time, service delays began increasing to the point that the service was non-functional. Initial attempts to restore service on machine were not successful over the long term and we ended up moving the service off of the affected host and into AWS. The underlying machine that the service was running on has been deprecated. We have sent the projects with zones there notice that we plan to shut down the machine in the not too distant future. We have begun organizing replacement resources in lieu of the Solaris zones. The machine that ran the Foundation's SVN master suffered an outage caused by a failure of the root filesystem array. The initial public reporting is at: https://blogs.apache.org/infra/entry/subversion_master_undergoing_emergency_maintenance The post-mortem report from this event is at: https://blogs.apache.org/infra/entry/svn_service_outage_postmortem We were able to resurrect the service on new hardware with the configuration residing completely in configuration management. This should minimize the time to recovery for future issues. The recovery took approximately 2 days to complete. We suffered a loss of all network connectivity to services in our colo facility at Oregon State University Open Source Lab on 10 December. The outage lasted almost 2 hours. We are still working with OSUOSL, OSU, and NERO to figure out what happened. A redundant (but disabled) network link was activated to bring us back online. While monitoring the newly provisioned webserver, we discovered that Cassandra is pointing users to a .deb package repository on the main webservers instead of utilizing the mirrors, as package repositories won't function with our current mirror offering. After some analysis we found that this package repository was the source of 15% of all traffic hitting our webservers. Our initial thought was to block that traffic. Doing so would have had a large impact on folks. We are currently researching options to provide package repositories so as to remove that load from our main webservers. Uptime Statistics: ================== Unfortunately, uptime for critical services this month saw a sharp decline due to the subversion outage. Also affecting uptime was the brief network outage on December 10/11, as well as the migration of the git mirrors to a new location. Overall, due to all other services behaving exemplary, we did experience a slight increase in overall uptime by 0.03% compared to previous months. Thus, the total recorded uptime for 2014 (weeks 27 through 50) is as follows: Type: Target: Reality: Target Met: --------------------------------------------------------- Critical services: 99.50% 99.84% Yes Core services: 99.00% 99.77% Yes Standard services: 95.00% 98.38% Yes --------------------------------------------------------- Overall: 98.00% 99.39% Yes --------------------------------------------------------- For details on each service as well as average response times, see http://s.apache.org/uptime Contractor Details: =================== Geoffrey Corey: - Resolved 37 JIRA tickets - Worked on completing the steps to graduate 4 projects to TLPs - First round of On-Call duties - Rename argus podling to ranger - Coordinate with OSUOSL to replace disk in Tethys - Coordinate with Henk P. to migrate rsync.apache.org us host off eos and onto the AWS TLP server - clean TLP off eos to help recover disk space - Clean up lingering details with TLP/dist and rsync being migrated off eos - Helped in various ways to restore svn master after the failure of eris Daniel Gruno: - Helped to restore the subversion master after the failure had occurred - Worked on implementing more extensive SNMP monitoring of capable (mostly puppetized) hosts - Moved git.apache.org off the aging Solaris box and onto a new puppetized VM - Tweaked svn/git-to-github syncing process to cut down execution time from 2 hours to 8 minutes - Helped clean up some TLP snafus in relation to bringing svn templates to git. - Various on-call duties - Resolved 35 JIRA tickets since last report - Fixed various rendering issues with status.apache.org by using asynchronous calls Chris Lambertus: - Resolved 9 Jira tickets - Helped to restore the subversion master after the failure had occurred - On-call duties - fixed FUB VPN and rebuilt oceanus as cloudstack testbed - resolved ongoing issues with secmail.py - resolved major outage due to eirene vmware network failure - resolved nightly backup problems with several hosts - ongoing prototyping and evaluation of "enterprise" backup solutions - troubleshooting assistance with TLP/git/svn puppet migrations and service configs - documented nexus project creation process - ongoing addition of hardware to Dell OME, acquisition of windows license to make this a complete service Tony Stevenson: - Worked on the restoration of SVN master, including moving to Ubuntu and puppet - Started work on moving the git-wip-us service off of again baldr onto a Ubuntu VM. - Various on call activities - Attended Apachecon, where we had several infra sessions and an Infra meetup on the Sunday following the conf. - Worked on several JIRA issues and some longer term background activities like host patching. - Conducted the SVN master post-mortem exercise - Writing and updates of puppet modules to continue to make them platform agnostic Gavin McDonald: - Worked on 75 Jira Tickets, closing 57. - Infra commits SVN - 33 - On Call duties. - Work with new contractors on various issues. - Updating non puppet VMs/Machines for security updates. - More work on reducing cron mails - Work started on archive logging for HipChat - Restored SVN to Buildbot Hooks mechanism after the move.
New Karma: ========== Joe Schaefer has resigned from Infrastructure Committee (and root@) Finances: ========== $250 for AWS Operations Action Items: ======================== none Short Term Priorities: ====================== * Wiki outages - As highlighted in the October Infra report we have been running into issues with the MoinMoin wiki. This degradation is caused by severe disk IO load on the machine that hosts this service. This is complicated by the fact that this same machine also hosts the US web mirror for the foundation and projects as well as the mail-archives service. Additional fallout has been that publishing website updates has tremendous delays for websites on the US mirror. We think that much of the sudden IO load increase is due to the machines ZFS-filesystem growing to ~90%. Because of Copy-on-write nature of ZFS, and the allocation switches that happen when a volume begins approaching capacity, performances several degrades. We evaluated all of the services on the host, and our initial analysis was that we'd be best served by separating the web sites for projects and the foundation. During the course of doing that, we've discovered that a large number of projects were distributing artifacts and publishing their website using a long-deprecated (February 2012 deprecation) method. This, and other factors, have complicated the process, but today I am happy to report that we now have an easily replicable webserver definition in configuration management that allow us to easily deploy any number of webserver hosts in short order, and paid off a large amount of technical debt in the process, as well as having the newer members of our team understand well the entire website process from checkin to publication. This has dramatically improved the wiki situation, though it has revealed some underlying issues with the wiki that will need to be dealt with. Long Range Priorities: ====================== Monitoring ---------- While we continue to work on monitoring, and have far to go, this month yielded two major improvments. The first in disk capacity alerting, and the other in failed array management. This is not yet pervasive in our infrastructure, but is a start towards that end. Automation ---------- This month resulted in a large step forward as a number of services are now able to be deployed in an automated fashion and in configuration management. Many of these have been in-process for a month or longer. to include the following services: * committers mail-relay * all project and Foundation websites * rsyncd.apache.org (mirror distribution) * host provisioning dashboard * inbound email MXes Resilience ---------- The work done around automation has given us our first critical services that we can easily replicate and deploy multiple. To give you an idea of scale we can deploy an external email exchanger, completely configured, in less than 10 minutes, or all of the project websites in about 3 hours (largely bound by having to download all of the site content) Technical Debt -------------- This month saw us paying back large portions of technical debt. Of particular interest is the shuttering of legacy means of publishing releases and websites that were deprecated almost 3 years ago. We also were able to decouple a large number of very tightly bound services. General Activity: ================= It is worth noting that part of the new monitoring systems we have in place keep an eye on the status of internal disks and disk arrays. On the 30th October we were notified into our HipChat room that the machines that services as our svn master had a bad disk in its array. One contractor went to the datacenter to replace the disk with a spare from our inventory whilst another contractor configured and onlined the disk, re-adding it to the pool. A hardware issue being notified to being replaced and back online all in the same day. Code Signing ------------ Two more releases were signed this month, both from Tomcat. Three additional projects (Logging (Chainsaw), OpenOffice, and OpenMeetings) are now setup and enabled to sign artifacts, though most are still testing. Uptime Statistics: ================== Overall, uptime has seen an increase in 0.30% compared to last month, putting uptime for the october-november period at a record high 99.82% overall. The total recorded uptime stats since we started measuring it are as follows (weeks 27 through 46): Type: Target: Reality: Target Met: --------------------------------------------------------- Critical services: 99.50% 99.97% Yes Core services: 99.00% 99.77% Yes Standard services: 95.00% 98.14% Yes --------------------------------------------------------- Overall: 98.00% 99.36% Yes --------------------------------------------------------- For details on each service as well as average response times, see http://s.apache.org/uptime Contractor Details: =================== Daniel Gruno: Non-JIRA related issues worked on: - Explored and implemented a rewrite of our DNS system - Miscellaneous help/guidance for new staffers - Collated www+tlp server stats for an overview of our traffic/request rates - Worked on setting up Chaos as a disk array for Phanes (for unified logging) - On-call duties - Assisted Geoff in moving rsync'ed data to svn for web sites - Helped tweak httpd instance on tlp-us-east to cope with the request load - Tweaked status.apache.org, added hard-coded notice about wiki.a.o - Fixed dependency issues with Whimsy - Deprecated SSLv3 on all SSL terminators in response to POODLE Geoffrey Corey: Non-JIRA related issues worked on: - Finished migrating all tlp sites into puppet and new tlp host - Migrated lingering projects using rsync for artifacts distribution to using svnpubsub for distribution - Clean up retired sites with correct redirects for www.a.o/dist to their attic pages - Decommission/surplus the old hermes hardware - Replace disk in erris JIRA related tasks: - Resolved 11 JIRA tickets - Renamed incubator project optiq to calcite Gavin McDonald: - Worked on 52 Jira Tickets, closing 27. - Infra commits SVN - 40 - On Call duties. - Work with new contractors on various issues. - Resolve queries from IRC, HipChat and Email (no jira tickets) - Updating more Ubuntu machines/vms for Bash vuln. - Worked more on upgrading pkgng FreeBSD machines. - Continue Work on improving Pass rate of builds, liaising with projects as neccessary. - Configure new disk into Eris Array. - Adding packages to all Jenkins slaves via ansible is now working fine. Work started on doing the samd for Buildbot slaves. Chris Lambertus: - Closed 9 jira tickets - First on-call - Created new VM for status.a.o migration to Cloud (RAX) - Began work on evaluating backup and disaster recovery processes - Noted & resolved problems with zfs on abi causing failed backups - upgraded abi to FreeBSD 10.0 - purged extraneous zfs snapshots - analysis and evaluation of tools for improved backups - implemented collectd puppet module (monitoring) - added circonus monitors for new tlp (monitoring) - begin work on oceanus cloudstack eval - secmail.py troubleshooting & repair - MX incubator list troubleshooting with Tony - metis disk replacement PERC troubleshooting with Tony Tony Stevenson - Working on several major priorities: - eos - The main US webserver has slowly over time grown it's disk capacity usage levels - most recently growing over the threshold at which ZFS suffers from significant performance penalties. Disk capacity cant easily be increased and eos is scheduled for EOL so the short term goal was to tidy up the data on disk. This was acheived by moving some of the older static data to eris. See work by others on the overall status of retiring eos. Also see below commentary on a wiki migration PoC. - abi - The host in Traci.net (FL) that we have been using for an offsite copy of data for a number of years had suddently become increasingly unusable and jobs were failing. This was primarily caused by a failure in removing old data snapshots. This essentially stemmed from the period when hermes had to be rebuilt. A lot of triaging of old copies of data had to be done this was done in conjunction with others notably cml@ - hermes - Fixed a long standing issue with a faulty disk on the dungeon master that hosts hermes. Also worked on better manaaging the incoming mail queue as on occasion it backlogs and has a compound effect on genuine mail delivery. - chaos (host where ELK is to be part-deployed) this work was delayed until AC EU giuven the more urgent issues on eos and abi. - MX - After the unsuccessful attempt to migrate the MXes to new hosts run from AWS EC2, several lessons have been learnt and we have fixed all but one of these at the time of writing this report. The last fix is more complicated and needs significant testing to sign off, and then we need to expose this change to the mailing lists affected so that they are kepy in the loop, though we are aiming for a completely transparent cutover when we came to implement it. - As part of the bigger piece of work to unpick all the services on (and dependant upon) eos I have started work on setting up a PoC that will host the moinmoin wiki service (wiki.apache.org) in AWS EC2. This is making good progress and a data synchronisation should begin during AC EU allowing the team to see it working during the F2F on the Saturday after AC EU. - More puppet work creating new and adding 3rd part modules further extending the puppet managed aspects of our machines. - secretary@ workbench issues, this was seemingly related to a corrupt mbox file on minotaur. Moving this aside and having clr@ manually process the period allowed us to re-enable the automatic service, allbeit at a much slower frequency than before.
New Karma: ========== none Finances: ========== Domains: $85 Operations Action Items: ======================== None Short Term Priorities: ====================== Code Signing ------------ The code signing service is now live. Two projects have successfully shipped signed artifacts. Two more projects are in various stages of signing up, being vetted or testing out the service. Long Range Priorities: ====================== Monitoring ---------- Work has been ongoing to provide monitoring of the underlying hardware many of our services depend on, this is still in the early exploratory stages and remains ongoing Work building on our existing base of service status monitoring to provide insight, is also ongoing. Automation ---------- Much progress has been made around automation, multiple services are now fully defined in configuration management, though we still have much to go. Resilience ---------- Exploratory work to evaluate our ability to recover from disaster is underway, but is still early. Technical Debt -------------- General Activity: ================= The primary US web server that provides foundation and project websites in addition to mail-search and the moin-moin wiki has been problematic this week. Specifically we're running into multiple problems occurring at once resulting in tremendous IO overhead, and leading to very slow responses or outright failures. This has been a very busy month from a security perspective. Our Bugzilla instances went through multiple rounds of patches in response to 5 related security issues. Additionally, we've spent much time responding to Shellshock. Uptime Statistics: ================== Detailed Contractor Reporting ============================= * Geoffrey Corey - Get acquainted with ASF's system and services layout - Resolved 7 JIRA tickets - Clean out ASF's server racks and old hardware/spare parts at OSU - Inventory spare hw at OSU - Learn how to do TLP requests - Learn how to do svn to gi migrations - Setup AOO's mac mini build slave - Learn ASF's puppet infrastructure - Deploy NTP puppet module - Learn how svpubsub and svnwcsub are setup and used, create a puppet module for it - Learn how to use AWS to begin migration of TLPs off eos to fix IO issues * Gavin McDonald - Worked on 44 Jira Tickets, closing 30. - Infra commits SVN - 59 - On Call duties. - Work with coreyg liasing with hardware to be removed. - Installed 2 new machines at OSU, DRACs configured, ready for OS installs - Resolve queries from IRC, HipChat and Email (no jira tickets) - Work through more Cron noise. - Worked with Intervision and organised Warranty renewals/declines. - Updating many Ubuntu machines/vms for Bash vuln. - Worked on and continiung to work on upgrading the FreeBSD machines. There is much work involved as we are breaking free of Tinderbox based updates and going direct to the official repositories. At the same time packages/ports are being updated (forcibly) to the new pkg system. The machines done so far are showing no ill signs as of yet but we havent forced an upgrade of all packages, just a few essential ones. I expect as more machines are done, we'll start to see some things break. This is an unavoidable one way trip that we'll deal with. - Started looking into automating jira project key renaming, in support of dealing with when projects rename themselves. The Jira cli plugin looks promising, but doesnt seem to support renaming yet (but offers project cloning and deleting). Investigating the API directly but that too seems to lack support thus far. - Worked on improving Buildbot slaves stability. - Worked on improving Pass rate of builds, liaising with projects as neccessary. - Restored RAT reports for projects and the RAT master summary pages. - In our cwiki, added the ability for projects to access Intermediate HTML for diagnosing formatting issues in PDF exports. (https://cwiki.apache.org/intermediates/) - Github to Buildbot to HipChat integration, testing github commits to infra repo * Chris Lambertus Tickets: * Closed N (I don't know what N is) tickets. Outages: * Troubleshooting and resolution for the vmware host outage. * ongoing, www.a.o/mail-search/moin-moin troubleshooting Puppet/Automation * Learning puppet * Implemented dovecot and SNMP modules Monitoring: * Researched ways to monitor existing hardware that runs FreeBSD. * Began needs analysis and some PoC work around monitoring with Circonus, collectd, SNMP, etc. Disaster Recovery * Initial research into current state and how we might improve our disaster readiness.
New Karma: ========== Chris Lambertus (cml) Geoffrey Corey (coreyg) Finances: ========== RAM for VMware hosts $715 Replacement HDDs ~$1700 Mac OSX Build Slave $730 Puppet training $1300 Domains: $17 Operations Action Items: ======================== Short Term Priorities: ====================== ## Code signing Mark Thomas successfully concluded his testing and we were able to come to agreement with Symantec. The service has thus far been deployed with Mark Thomas leading efforts to deliver signed code for Apache Commons. The Apache Commons PMC is currently voting on release artifacts for the first signed binaries. Post-completion of this test the service will be available to any PMC requesting the service. ## Build/CI environment. http://s.apache.org/hDu ### Yahoo has graciously increased the number of machines that they provide (and provide colo services for) to a total of 20 machines this year. This has tremendously reduced the pending queue size for our build services. ### Cloud slaves Our RAX cloud environment is now being utilized by Jenkins to deploy (and destroy) machines on demand in response to load. Additionally, we’ve made a RAX account available to the Gora PMC for twice yearly testing they plan to engage in. Long Range Priorities: ====================== * Monitoring We are beginning to get insightful information out of monitoring. We now have a mail loop that provides information on the cycle time from sending to mail reception. Additionally we now have started monitoring some elements of host storage. Centralized logging is making slow progress but has a plan with a time table. * Automation The base level framework for machine automation is complete; and that work is expanding. As we begin to need to break services out we are building them with puppet. Additionally work to programmatically have JEOS machines for bare metal as well as virtualization and cloud targets is progressing nicely with most of that work expected to be wrapped up by end of month. * Technical Debt/Resiliency Some work has happened identifying long complaining error conditions in a number of processes and resolving them; currently focused on errors around backup scripts. General Activity: ================= * Welcomed two new contractors, Chris Lambertus and Geoff Corey, to the fold. * We’ve dealt with a unusually high number of failed hardware issues this month. * Sourceforge has reached out to infra regarding migrating Apache Extras to Sourceforge. * The machine that houses our US web server (for www.a.o and $tlp.a.o) as well as mail-search and the moin-moin wiki has experienced tremendous IO load. Work is ongoing to breakout those services and reduce total IO load for any given machine. This has been noticeable to end users in the form of wiki slowness and updates to project websites being slow on the US website. * repository.apache.org suffered a severe service degradation that resulted in many projects being unable to publish artifacts to Nexus for several days. For details see: http://s.apache.org/H2f * We’ve found a number of processes that infra executes that appear to be tied to being listed as a member in LDAP. We’re working to resolve that issue and tracking it in INFRA-8336 Uptime Statistics: ================== Targets remain the same in the last report (99.50% for critical, 99.00% for core and 95% for standard respectively). These figures span the previous reporting cycles as well as the present reporting cycle (weeks 34-38). Overall, the figures have gone up since the last report, and we are continuing to meet the uptime targets. Type: Target: Reality: Target Met: --------------------------------------------------------- Critical services: 99.50% 99.97% Yes Core services: 99.00% 99.77% Yes Standard services: 95.00% 97.46% Yes --------------------------------------------------------- Overall: 98.00% 99.16% Yes --------------------------------------------------------- For details on each service as well as average response times, see http://s.apache.org/uptime Detailed Contractor Reporting ============================= * Daniel Gruno: Work done since past report: - Cleared 30 JIRA tickets. See those for additional details. - Helped introduce Chris to his new job. This included setting up his account, putting it into the correct staff groups, assigning some easy JIRA tickets to get started with and walking him through the process of resolving these tickets. - Fixed some mailing lists mistakenly marked as private. This seems to be a reoccuring problem, so we will need to tighten our mlreq page and make it harder to create a private list. - Created new mailing lists for Reef. - On-call duties. - Started work on the Infrastructure presentation for ApacheCon EU. - Discussed doing a "Git at the ASF" talk with David at ApacheCon, as we have a free slot. - Dealt with Freenode's security breach (mainly rerouting some IRC services to the EU and resetting passwords). - Started work on resolving the current issues faced by non-member staffers. This will likely take some time to finish, and involve several people. Our first priority should be getting a new ACL set up for browsing the mail archives, so root has acces to this data. This is a sensitive operation, but one that should be well covered by the confidentiality clause in staffers' contracts. - ELK stack is progressing, storage setup expected to be done this week, at which point we will be able to start pointing some of the heavier services to it. - Answered queries from Joe Brockmeier re Hadoop moving to Git and the new status page. - Helped EVP and fundraising with the new Bitcoin donation methods (and answered queries on that). - Added commit comment integration with GitHub. This is still a work in progress, and I plan to rewrite then entire integration system when time permits. - Moved some VMs around in response to prolonged downtime on Erebus due to disk replacements. This resulted in minimal downtime for services (a few seconds at most). - Finished work on the subscription service for our monitoring of local project VMs. * Gavin McDonald: Work done since last report: - 68 Jira tickets closed. - 7 Jira tickets closed were Hardware related repairs - Disks, PSU and Memory The hardware situation is much better. Still a couple to resolve. - More Jenkins work done, Ansible issues determined but the slaves are unreliable at present. Still have no OOB access to them and 9 times out of 10 if a reboot is needed the slave doesnt come back. This situation is only tollerable for a certain perios and that is nearly up. David is in talks to get more slaves available. - Buildbot has been worked on some more, it got left behind due to other work but is now getting some love once more. There is one major nag in that some slaves (and seem to be only the new ones) are failing randomly with xml corruption failures even though the checkout performs fine. Testing shows that the xml isnt being returned (but only some of the time.). - There are plans to upgrade Buildbot Master this month. - There are plans to upgrade Confluence (accross several versions) this month. - On Call duties - Working through reducing cron mails
New Karma: ========== none Finances: ========== RAM for VMWare host: $690.79 Operations Action Items: ======================== Short Term Priorities: ====================== * Code signing Mark Thomas has successfully concluded his testing of Symantec Application signing service. Subsequent to that, he's identified a workflow that should work for our many projects. Conversations with Symantec on pricing are ongoing. * Response timeframe targets: It was agreed to set up three distinct timeframes for responding to incidents; 1) For critical services, incidents should be responded to within 4 hours 2) For core services, incidents should be responded to within 6 hours 3) For standard services, incidents should be responded to within 12 hours. The response need not be a resolvement of the issue, but needs to include one or more of the following steps; 1) Acknowledging the incident through internal channels (See PagerDuty et al) 2) Communication of the incident to the involved/affected people in accordance with the new communications plan laid out by the VP. 3) Delegation of the issue to a member of infrastructure whenever possible 4) Tracking of the incident (method depends on the duration and gravity of the incident) * On-call rotation: At the f2f meeting in Cambridge, it was decided to introduce an on-call rotation between contractors. Each week, a contractor will be assigned as being on-call, and will be responsible for either resolving, delegating or communicating about outages, account-, mailinglist- and tlp-creations, as well as planned changes, and security issues. To the extent that this is possible (not counting sleep), incidents must be responded to within the new target timeframes, as explained in the previous paragraph. In the time since going live with both an on-call rotation and tracking response time in the response timeframe, the responses have been dramatically faster than the service level expectations. As we build up staff numbers and spread geographically a bit, the expectations may change. * Improved response for and analysis of Java services at the ASF: At the previously mentioned f2f meeting, contractors were introduced to a detailed course of analysing and reporting incidents with Java applications run by infrastructure. We expect this new information to be extremely valuable in reaching and maintaining the target uptime for Java services. The staffers would like to extend a very big thank-you to Rainer Jung for his services in this matter. Long Range Priorities: ====================== * Monitoring ** Uptime monitoring and responsibility: In addition to the previous board report, a new service level agreement was made between infrastructure staffers, increasing the targets for uptime on critical and core level services, as described below in the statistics paragraph. Ensuring that services meet the new targets have been made one of the cornerstones of infrastructure's work. The monitoring of public facing services has been outsourced to a third party (free of charge), and we will be focusing on having Circonus produce metrics for our inwards facing services/devices, such as LDAP, PubSubs, SNMP etc. ** Unified logging: Experiments with unified logging is proceeding as planned, with more and more hosts being coupled into the new logging system. A filtering mechanism for the lucene-based backend has been created, allowing anyone to use the logging service based on their LDAP credentials. As such, anyone with access to a specific host (as defined in LDAP) will be able to pull logs from the unified logging system. We are confident that this will make debugging and analysis easier, to the point that we are disabling older alerting/information systems and using the logging system to fetch information that would previously have been sent via email to root@. * More virtualisation; Better use of what resources we have: It was decided to move towards more use of virtualisation for many critical and core services, including our main web sites and wikis. This will allow us to better respond to incidents and resolve them without affecting other services. Furthermore, it is our belief that we can free up resources by switching to a virtualised environment, thereby possibly getting more space for the crammed-up project/service VMs. * Automation ** Cloud-based dynamic build slaves have been in progress for a bit. Much of the work around this has been driven by Dan Norris. Building on a framework of repeatable builds he and Gavin McDonald have been successfully spinning up on-demand build slaves with our RackSpace account. This also relates to our goals around configuration management, and configuration of the machine is in Puppet. Expect to see this service go into production in the next week or so. ** Puppet - the scope of puppet deployment continues to edge forward. Gavin McDonald attended training just before the Infra F2F meeting. * Technical Debt and Resiliency Work around uptime monitoring and actually being able to better understand where the pain points are, coupled with some of the knowledge we gained at the Infra F2F has allowed us to focus on long term adjustments rather than hasty short term restoration of service. You should see this reflected in the uptime statistics General Activity: ================= - A face-to-face meeting between infrastructure members was held in Cambridge, UK. - 27 new committer accounts created, 8 new mailing list (TBC) - 3 projects were promoted to TLP - 193 JIRA tickets resolved (since last report) - A new status site was launched by Infra at status.apache.org - PagerDuty has donated a gratis account for up to 10 users to ASF Infra Uptime Statistics: ================== Due to new a SLA between contractors, the targets for critical and core services have been updated to reflect the new criteria (99.50% for critical and 99.00% for core respectively). This represents an overall increase of 0.57% uptime across the board. These figures span the previous reporting cycle as well as the present reporting cycle (weeks 29-33) Type: Target: Reality: Target Met: --------------------------------------------------------- Critical services: 99.50% 99.94% Yes Core services: 99.00% 99.81% Yes Standard services: 95.00% 96.83% Yes --------------------------------------------------------- Overall: 98.00% 98.99% Yes --------------------------------------------------------- For details on each service as well as average response times, see http://s.apache.org/uptime Detailed Contractor Reporting ============================= * Daniel Gruno - Resolved 51 JIRA tickets - Worked on a new status site for public ASF services - Worked on the ELK stack, set up on phanes/chaos. - On-call duties - Worked with Gavin and Tony on OpenSSL CVEs and general VM upgrades - Fixed issues with svn2gitupdate not working - Worked around a GitHub API change that had invalidated our integration measures - Miscellaneous upgrades to ASFBot - Continued work with uptime monitoring and reporting - Worked with Fundraising to produce statistics about the ASF - 8 days of vacation. * Tony Stevenson - Resolved 74 issues - Took part in the bugbash - On-Call rotation - FreeBSD/Ubuntu SSL CVEs. - Started work to disable swap across all VMs - Investigations into BigIP/F5 etc - Fixed issues with VPN applicance - Several disk replacements - Run down several repeat cron error messages - Some further conversations with others about puppet - Setup trial of lastpass with a view to possibly replacing our GPG files. - Instigated the trial of hipchat. With a view to seeing if we coild deprecate IRC. * Tony Stevenson - Comments Hipchat: For a long time I have been thinking about trying to find a better way to engage with some of our users. Also, I was hoping to find a way that we could get a better feed of information that was more relevant and pertinent. We have hooked it up to JIRA, Github, Pagerduty, and PingMyBox. These all provide near realtime information that we can act on. The more modern service, perhaps will be seen to be a move on from some of our older roots. Which might appeal to others. With the move we have also seen the SNR improve significantly enabling better communication across the team. The alerting with Hipchat allows people to be notified of communcations that involve them, via push messages to a phone/tablet etc. Also once you return online from an offline state you see all the history. The history is searchable. You can join us, here, https://www.hipchat.com/gw4Cfp7JY Private channels can also be created, for those who need a channel that need to control access. Think #asfmembers etc * Dan Norris - Built machine image automation using Packer - Packaged (using FPM) many of the unpackaged build tools (Provides repeatable, known installation; allows us to query for status and version) - Deployed a DEB repository in RAX CloudFiles for packages - Puppetized the build slave configuration - Documented the process of building a machine image, uploading it to RAX - Using jclouds plugin for Jenkins, successfully provisioned dynamic build slaves.
New Karma: ========== Dan Norris (dnorris) Finances: ========== * 64GB RAM for Arcas: $777.16 Operations Action Items: ======================== Short Term Priorities: ====================== * Signed binaries Mark Thomas has made significant progress in his efforts with Symantec around using their Binary Signing as a Service product. I have high hopes that we are near a proposed solution. * builds.a.o Much work has been done around builds.a.o; and it's largely stabilized. The past month has yielded 99.92% uptime. That's a far cry from the routine outages that were happening on average once a day. Long Range Priorities: ====================== * Monitoring ** Uptime monitoring and reporting: As an extension to defining core services and service uptime targets, Daniel has begun compiling weekly reports of uptime for most of the publicly facing services. These reports will in turn be compressed into monthly reports for the board as well as a yearly report detailing the overall uptime reality vs our set targets. Eventually, these reports will also feature inward facing services. ** Unified logging: Discussion and exploration has begun on unifying logging on all VMs and machines. The logging will be tied to puppet and allow for easy access to each hosts logs from a centralized logging database, as well as allow for cross-referencing data. Initial exploration into using LogStash with ElasticSearch and Kibana have begun, and are expected to produce findings for use in the next board report. * Automation Tony has expended effort and time in deploying a more updated, platform agnostic base for puppet. Giridharan Kesavan and Gavin have been experimenting with using Ansible for build slave automation. Dan Norris has begun work on automating VM/cloud provisioning * Technical debt Gavin began addressing cruft in many of our automated jobs; this will be a long term effort, but that work is underway and already yielding benefits In some ways, we are just beginning to collect information to let us know where we stand, and exactly how much debt we have accrued. The uptime reports, and comparing that with our first pass at service level expectations has started occurring. * Resiliency Our efforts around resiliency are still nascent. We have begun to address a few issues caused by resource constraints, though this is a very minor attempt to provide true resilience. As other efforts in our long term priorities take shape, I expect that we'll begin to see this accelerate. General Activity: ================= * Dealt with yet another batch of OpenSSL CVEs affecting all hosts. * Upgraded Arcas (JIRA host) with 64GB RAM to deal with slow response times. This has greatly reduced the response time from Jira. See screenshot detailing that change: http://people.apache.org/~ke4qqq/ss.png * Welcomed a new contractor, Dan Norris, to the fold. * Face-to-face meeting in Cambridge between infrastructure people * Created 23 new committer accounts, 4 new mailing lists Uptime Statistics: ================== These figures currently span weeks 27 and 28 of this year, and only cover public facing services. Type: Target: Reality: Target Met: --------------------------------------------------------- Critical services: 99.00% 99.98% Yes Core services: 98.00% 99.84% Yes Standard services: 95.00% 92.71% No --------------------------------------------------------- Overall: 97.43% 97.80% Yes --------------------------------------------------------- For details on each service as well as average response times, see http://s.apache.org/uptime  The target for standard services was not met due to our Sonar instance being unstable at the moment and only having around 50% uptime. We are investigating the issue. Contractor detail: ================== * Gavin McDonald Short term Jobs worked on this week: ============================= Jira. ------ Jira tickets worked on  See: jql query 'project = INFRA AND updatedDate >= '2014/06/16' AND updatedDate <= '2014/06/22' AND assignee was ipv6guru ORDER BY updated DESC' Jira Tickets Closed  See: jql query 'project = INFRA AND resolutiondate >= "2014/06/16" AND resolutiondate <= "2014/06/22" AND assignee was ipv6guru ORDER BY updated DESC' My Open Jira Tickets  See: jql query 'project = INFRA AND status != Closed AND assignee was ipv6guru ORDER BY updated DESC' Commits made Infra repo: 22 June 16th saw planned downtime at OSUOSL. The downtime window was 2 hours between 11am UTC and 1PM UTC. Both myself and Daniel Gruno covered this outage window and also at least 2 hours before and after the planned window. Actual downtime we saw was 2 minutes at 11:55am. Ongoing answering of queries on the infra@ and build@ mailing lists, including quick resolutions to issues raised. The same goes for IRC - Channels open at time of writing are: #sling #asfboard #jclouds #asfmembers #asftac @#abdera #avro #osuosl #+#buildbot #asftest @#asfinfra Worked on various buildslaves of both Buildbot and Jenkins, updating, upgrading, patching for SSL etc. Worked on upgrading SSL for several other VMs, at the same time taking the time and opportunity to update/upgrade/dist-upgrade and reboot. Ongoing Medium Term Jobs: ======================= Dell Warranty Renewals. ------------------------------------ Involves Liaising with various Dell Reps via email. Service Tags have been 99% been brought upto date and documented in the service-tags.txt file. Make decisions on warranty renewals based on age and whether it is in our plan to renew the machine within the next 9 months. Get quotes for and give the go ahead to Dell for those we intend to renew. The current email noise from Dell regarding these is quite high so this is a task I intend to complete over the next few weeks - to either renew, or decline and stop renewals emails. Root Cron Job Emails. -------------------------------- Involves sifting through root@ Cron emails from various machines and vms. Determine the current important ones that can be assessed and fixed to completion. Previously, this was just 'done' and perhaps followed up with an email reply to a cron job in question. For better visibiilty and reporting, I have now started creating Jira Tickets for these tasks; and also given these tickets the 'Cron' label. I expect to make steady progress and have the cron mails halved at least over the next 3 months. See JQL Query: 'project = Infrastructure and labels = Cron' Confluence Wiki. ----------------------- Confluence needs an upgrade. Test instance is in progress. I hope to have this done in the next couple of weeks. Ongoing Longer Term Jobs: ====================== Jenkins/Buildbot ---------------------- Some time has been spent improving the stability of Jenkins Server and its Slaves. With thanks Mainly to Andrew Bayer recently the Server has improved dramatically. The slaves have seen improvement in stability and uptime too, including the 2 windows machines. I have spent a fair bit of time recently on these. I need to create new FreeBSD and Solaris slaves for Jenkins. The former I think we can achieve in the Cloud whilst the latter I don't think is supported at RackSpace, investigating. Might need to create our own VM image for it. At the time of Writing, 34 Builds are in the Jenkins Queue, mostly attributed to these missing two slave OS flavours and also Hadoop jobs. Buildbot stabilty is just about back to normal after I rebuilt the Master from scratch on a new OS Freebsd 10 (prepped by Tony). The forced upgrade of the Buildbot Master version itself also caused some instability for a while due to configuration upgrades required. This affected just about all projects using Buildbot and the CMS. I note that the Subversion project has indicated that a Mail should have been sent to the Subversion PMC about the downtime suffered by the Subversion project as a result of the code changes required by the forced upgrade. Following this advice, I'd have had to email another 30+ PMCS also telling them the same thing. I find that my generic email to the infra list should have been enough information for all parties concerned. Cloud for Builds. ---------------------- Rackspace - A test machine has been created. Jenkins has yet to make use of this however and I'm in progress of working out the best way to integrate with our systems - do we use LDAP, Puppet etc with it or create a custom image we can replicate. I'll also be starting work soon on a Buildbot test instance for on demand. Microsoft Azure - A test machine with windows server 2012 is up and running and I have access. I am in progress of making changes to this image to make a baseline so that the Azure team can replicate several more once I have it right. Once done for Jenkins I'll do the same for Buildbot; and make sure to leave 2 or 3 instances available for general project use, which I'll advertise as available once ready. Puppet - Have completed online pre-training puppet course as advised by David, using a Vagrant instance via VirtualBox. I continue to invest a couple of hours a week in looking through the Puppet Labs online and Documentation. I continue to investigate the best methods of integrating the Jenkins and Buildbot Slaves with Puppet, though I'm really in a waiting pattern for our puppet master to be upgraded to v3. * Tony Stevenson Took two weeks of vacation Having spent a considerable amount of time trying to make a new Puppet3 master on a FreeBSD box this however did not pan out - there were far too many little changes from a standard deployment needed and we were still having ssl issues with puppetdb. A new Ubuntu VM has been built as the new puppet master and is now about done. One more test to run tomorrow. Spent a little bit of time on-boarding jake into root@ activities (a/c creation etc). Issues with Erebus VMware host. Needed a reinstall of the vsphere agent and reconnecting to the management console. New infra-puppet GitHub repo * Daniel Gruno Work log for Week 27: ===================== - Create mailing lists for new and existing podlings - Access to metis+eris for jake. - Set up svnpubsub/cms for new podlings - Evaluate ELK stack (ElasticSearch, Logstash + Kibana) - Work on factoid features for IRC - Ordered 64GB RAM for Arcas (8x8GB, replacing 7x4GB) - Upgraded hardware on Arcas (JIRA host) - Set up dist areas (some requests proved invalid) - Investigate and fix database issues with ASF Blogs (twice) - Monitor and compile uptime records for core services over the last week. (The majority of my time was spent evaluating and tailoring the ELK stack, as well as the math fun with semi-automating uptime reports.) Work log for Weeks 25 and 26 (sans JIRA tickets): ================================================= - Updated ASFBot with some minor bugfixes and feature additions - Worked on Git mirroring between ASF and GitHub (aka svn2gitupdate) - Assisted in applying web server updates for projects - Design discussions with Jan and Gavin about Circonus monitoring (still ongoing, awaiting results of initial test) - Discussed GitHub PR usage with the Usergrid project - Investigated and solved an issue with JIRA not responding - Worked on updating OpenSSL on all affected machines (CVE-2014-0224 et al, ~95% done, should be done by the end of this week (ceteris paribus)) - Worked on an issue with nyx-ssl and puppet (still unresolved) - Worked with Gavin to monitor and respond to OSUOSL network upgrades. Resolved. - Helped projects tweak settings for IRC relaying of commits/JIRAs - Worked on anti-spam measures for modules.apache.org(still under infra's umbrella) - Worked with Dave to resolve the blogs 404 issue. Resolved in week 27 by Brett Porter.
New Karma: ========== * Jake Farrell (jfarrell) was added to root * Andrew Bayer (abayer) was added to infrastructure-interest Finances: ========== Discovered a past due bill from Dell based on Justin Erenkrantz getting collection phone calls. ~$1300 Placed order for 2 servers, totalling nearly $17,000 With help from EA, arranged for travel for a F2F as well as travel for contractor training; thus far that has cost ~$9193 Operations Action Items: ======================== None at the moment Short Term Priorities: ====================== * OSU Hardware failures Work continues on hardware failures at OSUOSL - replacement hardware has been ordered and shipped, work continues on getting it swapped in while minimizing outages. * Outage remediation Much work continues from the action items drawn from the post-mortems. * Builds.a.o We've received a good deal of help from the Jenkins community in finding and dealing with issues. Long Range Priorities: ====================== * Monitoring Circonus has now replaced Nagios as our monitoring system with lots of help from Jan Iversen. While we still have a very long way to go, the system is already proving useful; having alerted us to a number of issues. * Automation Slow progress continues on rolling out configuration management in efforts to make our infrastructure better documented and more easily reconstructed. * Technical Debt We have begun publishing/discussing early drafts of documents around expected service levels as well as a communications plan; which are very early steps in beginning to prioritize work around our technical debt. See: https://svn.apache.org/repos/infra/infrastructure/trunk/docs/services/LEVELS.t xt and https://svn.apache.org/repos/infra/infrastructure/trunk/docs/vp/comms_plan.txt * Resiliency Discussions around resiliency have started; but are still nascent. General Activity: ================= In the month of May Infra had 194 tickets opened, and closed 158 tickets in Jira. For the month of May, Jake Farrell closed the largest number of tickets with 56. highlights include: * Dealt with emerging DMARC issue and blogged about it at https://blogs.apache.org/infra/entry/dmarc_filtering_on_lists_that * Rewrote our qmail/ezmlm runbook documentation to bring it up to date. * Raised potential UCB issues with our current organizational usage of committers@. See INFRA-7594 for background. * Dealt with a crop of openssl-related security advisories. * Dealt with two as-of-yet unpublished security vulnerabilities. * Published a blog entry on the mail outage postmortem: https://blogs.apache.org/infra/entry/mail_outage_post_mortem * Confluence was patched after advance warning from Atlassian before they went public with a security vulnerability. * During the course of compiling an inventory for Virtual and adding in our cost to purchase those units, we discovered that 5 machines were not in our inventory. Three of those machines were either unutilized or underutilized. This will likely reduce some of our expected hardware spend as they were relatively recent purchases. http://apache.org/dev/machines.html * Enabled emails sent from apache.org committer addresses (or any addresses in LDAP) to bypass moderation across all apache.org mailing lists. No changes to SPF records for the foreseeable future.
New Karma: ========== Andrew Bayer (abayer) was granted jenkins admin karma. Finances: ========== Infra spent or authorized to spend almost $3300 thus far in the new fiscal year; all related to replacement hardware or service for hardware. Operations Action Items: ======================== None Short Term Priorities: ====================== * OSU Hardware failures We have a number of hosts that have degraded or dead hardware in our Oregon colo. This is mixture of machines that are in and out of warranty and involves machines that host both core services and less important machines. Status is being tracked at: https://pad.apache.org/p/osustatus * Outage recovery Coming out of our outages we have substantial number of remediation items. In some cases the service has been restored but is not back to pre-outage levels of operation. * Builds.a.o Stabilization of Jenkins is a primary concern. Much work has happened from volunteers and contractors alike (see comments below in general activity as to improvement.) We are still suffering from service failures every couple of days at this point. Long Range Priorities: ====================== * Monitoring Our monitoring still lacks the level of insight to provide operationally significant information. Work continues on this front. Our new monitoring system (Circonus) should come online in the next few weeks; but much remains to be instrumented for it to be truly useful. * Automation Slow progress continues on rolling out configuration management in efforts to make our infrastructure better documented and more easily reconstructed. * Technical Debt Work is ongoing to prioritize services infrastructure provides and to set expectations and service levels around the services. * Resiliency I wish that I could say that much work has occurred here; but most of the month has been focused on outage recovery. The beginnings of that work has taken place in working to restore a stable platform. (see the note around hardware at OSU) General Activity: ================= Infrastructure suffered three major outages in this reporting period. The first involved the Buildbot host and a disk failure. CMS and Buildbot project build were down for several days while the machine was rebuilt. The second outage was the blogs.a.o service. You can see the details and remediation steps that are being taken here: https://s.apache.org/blogspostmortem The third was a 4 day outage of our mail services. You can see the results of the post-mortem here: http://s.apache.org/mailoutagepostmortem As of this writing there is still a significant backlog of email being processed. At current rate, we expect the backlog to be cleared by May 16th. The Buildbot host aegis lost a disk also and the machine was rebuilt over a few days, changing from Ubuntu to FreeBSD 10. The CMS and project builds were down whilst this happened. At the same time the Buildbot Master version was upgraded to the latest release which caused some tweaks to the code and project config files. Infra has noted an increased level of concern regarding the CI Systems and in particular the Jenkins side of builds. Some projects are concerned about the level of support that Infra gives these systems. A combination of factors over the last months has seen a decline in support - other higher priority services taking up time, a decline in volunteer time, an increase in projects using the systems and in parallel an increase in build complexity, all making for a decline in available resources due to slave increases not happening in a scaled manor to match. All this is being resolved as we speak and improvements are being made; and there are many plans for the short/medium and long term. The work done already is showing progress. In example on 2 May the average load time for builds.a.o was 72.69 seconds and the average number of builds in the queue was 65. On 13 May the average load time is down to 1.86 seconds and the average number of jobs in the queue is less than 4. Much work remains to be done. For data see: http://people.apache.org/~hboutemy/builds/ Plans to address existing issues: ================================== As has been noted infra ran into a number of problems bringing a number of key services back into use. There are a number of planned steps that are either remedial or work around building a more robust foundation. All of these tie back into the long term priorities you see above. Below are things I have requested one or more contractors: * SLAs - We're dividing up services into various criteria. Failures happen, but our level and rapidity of response as well as the degree to which we engineer for failure must be measured against how critical the service is. The current plan is to submit the finished work to the President for review and discussion with an audience he deems appropriate. * Prioritization of hardware replacement - New hardware doesn't guarantee against an outage. However, continuing to test the mean time between failure for underlying hardware tends to increase risk on average. Along with this prioritization; I've asked that each of the services being replaced be done by a person who isn't the 'primary' for that service. That list is not yet complete, but is being worked on. * Documentation - Currently the quality varies from service to service. Some of our documentation is clearly out of date, some is decent. My experience is that most documentation suffers from bitrot in any organization. However; I've requested multiple folks to bring our docs for various services up to a usable state. I've also requested for folks other than those who produced or will be producing the documentation to review the documentation and use it to ensure it is accurate and adequate. * Backups - In general, our backups, where they've been happening, have been sufficient. We've already had work around documenting restoration from backup get committed to our docs in SVN. Additionally short term tasks have been handed out about establishing, verifying, or restoring backups as well as checking that against the services and documentation. There's also tasks in place to work on speeding up our restore timelines. * Automation - We possess a lot of operational automation (scripts and other tools that allow us to create or subscribe to lists, create users, etc.) We have bits and pieces of infrastructure automation - but it's not widespread. In the three outages we've experienced catastrophic failure of the hardware resulting in the need to rebuild the service from scratch. Virtually all of the moving pieces involved manual processes from OS installation to service configuration. That dramatically increased our time to recovery; as well as being prone to user error. To that end; I've requested the following: - Consolidate the number of platforms we support for core services. We currently have Solaris, Ubuntu, two major versions of FreeBSD. I've asked for a single version of Ubuntu and a single version of FreeBSD to be adopted across all of our non-build and non-PMC infrastructure. - Deploy an automated OS installation tool - During the mail outage we had to get smarthands in the datacenter to burn a OS install DVD and deploy a fresh operating system twice. This meant that a ten minute task turned into more than hour in each case. I've set the criteria that we be able to deploy our installs over the network and control booting and other functions via an out-of-band management tool such as IPMI. We must also be able to host our own package repositories. - Configurations management - We currently have puppet deployed but it isn't widely used within our infrastructure. Puppet permits you to declare state in it's domain specific language that controls how a machine is configured; what software is installed as well as collect data on the machine itself. Puppet also enforces state; and this enforcement is, quite frankly, better than documentation. Even if a machine is completely destroyed, by having done the work in puppet we, know the exact state of the machine and can deploy that exact configuration back to a new machine in a matter of moments. To that end I have planned the following items: - Training - Most of the infra contractors have not used puppet in anger. Beginning in the next few weeks; they'll make use of some gratis online training from Puppetlabs with plans for attending a hands-on class within 6 weeks. (budget-willing) - Mandatory use for new services. I've asked that all new work and services being stood up must be done using puppet. - Service restoration. For core services that have failed recently. we've either updated documentation or have tasks to do so. I've requested tasks for translating that into puppet manifests to dramatically reduce our mean time to recovery. For services that will move to new hardware; if that involves the recreation of the service I've asked that be done via config management as well. - Base OS deployment. The base OS deployment at the ASF is very well documented. In the case of FreeBSD it's ~26 individual manual steps that must be executed every time. In conjunction with work on an automated OS install; I've asked that all of the base OS deployment and configuration be automated via puppet. - Monitoring - put simply, our monitoring does not currently provide enough insight. In example, we did not know about the failing hardware underlying our mail service. According to our monitoring, things were fine. This isn't to say that knowing about it would have prevented the outage, but I would at least like the advantage of timely knowing about it. As mentioned elsewhere; when smarthands were working on our equipment they noted that many of our servers were complaining about hardware problems. Monitoring is largely grunt work; knowing what to monitor for each service is something that the contractors can rattle off. Actually setting up monitoring is a large time sink. We currently have a volunteer doing a good chunk of work; and my plan is to temporarily (3-6 month timeframe) supplement that with an outside contractor who is already familiar with our monitoring system and puppet. None of the above prevents failure. It might give us an edge in detecting that a failure is about to occur, or permit us to drastically reduce our time to recovery; but it does not actually keep bad things from happening. The longer term piece of this puzzle is to begin engineering our most important services to be more redundant or more fault tolerant. Most of our services are not setup this way. Our first target is going to be the mail service; we are doing this for two reasons. First our experience with the mail backlog and the hoops we had to clear to empty that backlog suggest that we aren't very far from the limits of our current architecture. Second, as you've seen in the past few weeks, a mail outage is absolutely crippling for the Foundation. That said, please understand, that the problems we have, are not going to be solved in the short term. By the end of the quarter I hope to be able to report that we have a good start on these initiatives, but this is a long term effort. Unless luck intervenes it's almost inevitable that we'll suffer another outage this year. Hopefully we'll be in a better place to respond to those outages as we go forward.
New Karma: ========== Finances: ========== No purchases/renewals for the month since last report. Operations Action Items: ======================== Short Term Priorities: ====================== * Look into mac build slaves. * Converge on git.apache.org migration to eris. (Step 1 is merge git -> git-wip on tyr) (opinions?) * Investigate / negotiate external code-signing capability, currently in talks under NDA. INFRA-3991 is tracking the status, and a Webex call has taken place. * Complete nagios-to-circonus migration for monitoring. * Continue to experiment with weekly team meetings via google hangout. * Explore the possibility of revamping the infra documents to have a more intuitive feel about them, improve readability. * Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less painful this time around. (Support case closed, nothing useful came from it other than check the logs.) * Port tlp creation scripts over to new json-based design on whimsy. Long Range Priorities: ====================== * Choose a suitable technology for continued buildout of our virtual hosting infra. Right now we are on VMWare but it no longer is gratis software for the ASF. * Continue gradually replacing gear we no longer have any hardware warranty support for. * Formulate an effective process and surrounding policy documentation for fulfilling the DMCA safe harbor provisions as they relate to Apache services. * Institute egress filtering on all mission-critical service hosts. General Activity: ================= * New 3-year wildcard SSL cert purchased and installed for *.openoffice * Thrift migrated to CMS with the aim of providing better support for similar sites. Blog entry here: http://blogs.apache.org/infra/entry/scaling_down_the_cms_to * The number of confluence administrators was significantly reduced, this was to try and keep the list as small as possible. Historically this permission level was required to operate and manage the autoexport plugin which has since been deprecated. see https://issues.apache.org/jira/browse/INFRA-7487 * An inter-project communication site was requested by the community at ApacheCon and is being looked into by infra. This will essentially be an aggregator of project development wishes/requests, and will most likely reside on wishlist.a.o. * As a way of lowering the bar for and securing security reports, infra is looking into creating a system which, based on LDAP, accepts and encrypts security reports for projects. The exact setup and nature of this system is being discussed, primarily with members of the subversion PMC. * Heartbleed happened: see https://blogs.apache.org/infra/entry/heartbleed_fallout_for_apache * Two members of the infrastructure team attended Apachecon NA 2014 and had a few community sessions with committers to hear their concerns and attempt to address them. Also met with Cloudstack members to discuss their widely publicized proposal for additional infrastructure needs surrounding project builds.
New Karma: ========== mdrob added to infra-interest. Finances: ========== Operations Action Items: ======================== Short Term Priorities: ====================== * Look into mac build slaves. * Converge on git.apache.org migration to eris. (Step 1 is merge git -> git-wip on tyr) (opinions?) * Investigate / negotiate external code-signing capability, currently in talks under NDA. INFRA-3991 is tracking the status, and a Webex call has taken place. * Complete nagios-to-circonus migration for monitoring. * Continue to experiment with weekly team meetings via google hangout. * Explore the possibility of revamping the infra documents to have a more intuitive feel about them, improve readability. * Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less painful this time around. (Support case closed, nothing useful came from it other than check the logs.) * Port tlp creation scripts over to new json-based design on whimsy. Long Range Priorities: ====================== * Choose a suitable technology for continued buildout of our virtual hosting infra. Right now we are on VMWare but it no longer is gratis software for the ASF. * Continue gradually replacing gear we no longer have any hardware warranty support for. * Formulate an effective process and surrounding policy documentation for fulfilling the DMCA safe harbor provisions as they relate to Apache services. * Institute egress filtering on all mission-critical service hosts. General Activity: ================= * The new GitHub features have been well received, with 28 projects already onboard with the new features in February alone. As a result, the number of github related messages on the public ASF mailing lists have risen from 304 in January to 3,616 in February, with expectations to exceed 5,000 in March. There has been a discussion on whether to transition from opt-in to opt-out on these features, but for the time being, it remains opt-in. * Instituted a weekly cron to inform private@cordova about the current list of committers not on the PMC, which should be the empty set. Currently about a third of the pmc is impacted with no indication that this will ever be addressed by the chair- the requisite notices have already been sent to board@. * Discussed the current state of affairs with our build farms as they relate to TrafficServer's needs. We intend to address this with increased funding in next year's budget. * Received a report about several compromised webpages hosted by VM's associated with OfBiz. In the process of working with the PMC to correct this situation.
New Karma: ========== Finances: ========== Board Action Items: =================== Short Term Priorities: ====================== * Look into mac build slaves. * Converge on git.apache.org migration to eris. (Step 1 is merge git -> git-wip on tyr) (opinions?) * Investigate / negotiate external code-signing capability, currently in talks under NDA. INFRA-3991 is tracking the status, and a Webex call has taken place. * Complete nagios-to-circonus migration for monitoring. * Continue to experiment with weekly team meetings via google hangout. * Explore the possibility of revamping the infra documents to have a more intuitive feel about them, improve readability. * Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less painful this time around. (Support case closed, nothing useful came from it other than check the logs.) * Port tlp creation scripts over to new json-based design on whimsy. * Ensure all contractors are participating in on-call situations, minimally by requiring cell-phone notification (via SMS, twitter, etc) for all circonus alarms. * Explore better integration with GitHub that allows us to retain the same information on the mailing list, so that vital discussions are recorded as having taken place in the right places (if it didn't happen on the ML...). Long Range Priorities: ====================== * Choose a suitable technology for continued buildout of our virtual hosting infra. Right now we are on VMWare but it no longer is gratis software for the ASF. * Continue gradually replacing gear we no longer have any hardware warranty support for. * Formulate an effective process and surrounding policy documentation for fulfilling the DMCA safe harbor provisions as they relate to Apache services. * Institute egress filtering on all mission-critical service hosts. General Activity: ================= * Migrated dist.apache.org from backups of thor to eris. Unfortunately a dozen commits were naturally lost in the process. Thanks to TRACI.NET for providing additional bandwidth for this purpose. * Jira: Jira is now runnning on Apache Tomcat 8.0.0 (rather than 7.0.x). While running on 8.0.x is unsupported by Atlassian, this is providing valuable feedback to the Tomcat community. To mitigate the risk of running an unsupported configuration, Jira is being monitored more closely than usual for any problems and there is a plan in place to rollback to 7.0.x if necessary. * At the behest of committers, we have started working on a stronger implementation of GitHub services, including 'vanity plates' for all Apache committers on GitHub. A method of interacting with GitHub Pull Requests and comments has been completed, that both interacts with the GitHub interface and retains all messages on the local mailing lists and JIRA instances for record keeping. At the time of writing, we have 367 committers on the Apache team on GitHub. We have made a blog entry about this at http://s.apache.org/asfgithub which seems to have reached many projects already. Furthermore, the Incubator has been involved in the development of this, and are thus also aware of its existence and use cases. * The new SSL wildcard was obtained from Thawte earlier this month, and will be rolled out to services very soon. Thanks to jimjag this got the business end of the deal done so we could actually get the cert in before the incumbent expires. * All remaining SVN repos have now been upgraded to 1.8. * Resurrected thor (mail-search) after soliciting help from SMS for on-site repairs. * Amended release policy to provide rationale and spent time explaining the new section to members@. See http://www.apache.org/dev/release#why * Work with Cordova on processing their historical releases to comport with policy.
New Karma: ========== Finances: ========== Board Action Items: =================== Short Term Priorities: ====================== * Look into mac build slaves. * Converge on git.apache.org migration to eris. (Step 1 is merge git -> git-wip on tyr) (opinions?) * Investigate / negotiate external code-signing capability, currently in talks under NDA. INFRA-3991 is tracking the status, and a Webex call has taken place. * Look into rsync backup failures to abi. Look into clearing out a lot of room on abi - currently 20GB left and 20GB+ a day gets backed up. * Complete nagios-to-circonus migration for monitoring. * Continue to experiment with weekly team meetings via google hangout. * Explore the possibility of revamping the infra documents to have a more intuitive feel about them, improve readability. * Confluence Upgrade. Upgrade from 5.0.3 to latest. Hopefully will be less painful this time around. (Support case closed, nothing useful came from it other than check the logs.) Long Range Priorities: ====================== * Choose a suitable technology for continued buildout of our virtual hosting infra. Right now we are on VMWare but it no longer is gratis software for the ASF. * Continue gradually replacing gear we no longer have any hardware warranty support for. * Formulate an effective process and surrounding policy documentation for fulfilling the DMCA safe harbor provisions as they relate to Apache services. General Activity: ================= * Confluence: Finally got it to upgrade to 5.0.3. Database edits and conversions were needed to make the transition. After a few days bedding in it seems to be performing much better than the previous version. * Translate.a.o: Upgraded to 2.5.1-RC1 (that is a release). Severe compatibility issues. Reprogrammed part of LDAP connection, to make it more stable (and work). * [2nd Jan 2014] - Jenkins Master was migrated to a much needed new server. This also eases the pressure from Buildbot Master since the split of hosts. * Migrated SVN repositories to newer, larger, and hopefully quicker array on Dec 31st. The repository upgrades will now be done in the coming weeks once we have seen stability in the Infra repository for at least 1 week. We will then likely re-purpose the SSD in the old array and add them to the new array for improved caching. Total downtime for the move was 1h15m as the prep work had been undertaken for at least 2 weeks before. * RE: Symantec code signing service - There are a handful of internal tasks to complete before we can move on. * Migration and reinstallation of continuum-ci.a.o (was vmbuild.a.o) has taken place. [Final checks are in progress before announcing its GA] * [5th Jan 2014] - blogs.apache.org was upgraded by the roller project * Faulty gmirror disk on eris, liaised with OSUOSL and swapped out disk.
New Karma: ========== Finances: ========== * Funded Daniel Gruno's attendance at EU Cloudstack conference: cost TBD. Board Action Items: =================== Short Term Priorities: ====================== * Clear the lengthy backlog of outstanding tlp-related requests. * Repurpose the new hermes gear for use as a (jenkins?) build master as that is more pressing. * Investigate the migration tooling available for conversion from VMWare to Cloudstack [See attachment INFRA-1]. * Look into mac build slaves. * Migrate eris svn repos to /x2, converting everything to 1.8. * Converge on git.apache.org migration to eris. * Investigate / negotiate external code-signing capability, currently in talks under NDA. INFRA-3991 is tracking the status, and a Webex call is being arranged. * Look into rsync backup failures to abi. * Complete nagios-to-circonus migration for monitoring. * Continue to experiment with weekly team meetings via google hangout. * Continue with the effort to reduce the overwhelming JIRA backlog. At the start of the reporting period we started with 134 open issues. We are now down to ~90 open issues. * Jan Iversen has been pushing through the outstanding Tlp requests for virtual machines. Several projects should by now have their VM. * Explore the possibility of revamping the infra documents to have a more intuitive feel about them, improve readability. * Confluence Upgrade. Needs an intermediate upgrade to 5.0.3 then to latest. Attempts have been madse and failed to upgrade to 5.0.3, opened a support case but we are cotninuing to try on a test instance. Long Range Priorities: ====================== * Choose a suitable technology for continued buildout of our virtual hosting infra. Right now we are on VMWare but it no longer is gratis software for the ASF. * Continue gradually replacing gear we no longer have any hardware warranty support for. * Formulate an effective process and surrounding policy documentation for fulfilling the DMCA safe harbor provisions as they relate to Apache services. General Activity: ================= * Both new tlp's this month, Ambari and Marmotta, were processed within 24 hours of board approval. Attachment INFRA-1: Cloudstack Conference Feedback [Daniel Gruno / humbedooh]: ============================================================================== Attended CCC (Cloudstack Collaboration Conference) at Beurs van Berlage in Amsterdam. Tried out Cloudstack locally with a /27 netblock, as well as on testing platforms available at the conference. Apart from minute errors in the UI (which I have reported), it seems to be working as expected. Cloudstack supports LDAP integration, however this is not a feature complete integration, and it is my view that an infra-made LDAP implementation - with regards to _non-infra involvement_ - is preferred, though we may elect to use it for the administration of the hosts. Attended a talk about Apache LibCloud which seamlessly integrates with Cloudstack for an easy programmable management of VMs via Python. This removes the need for dealing with the rather cumbersome Cloudstack API, and enables the possibility of creating an infra-managed site for dealing with VMS in several ways. Should we ultimately decide on another cloud solution, LibCloud integrates with just about every platform out there, and so would not be affected by this to any large degree. I did not get a chance to properly test LibCloud, so my findings in this regard will have to be substantiated at a later date. Cloudstack offers support for both VMWare (WebSphere), Xen(cloud), KVM, so migrating is just as much a question of "if" rather than just "when". It supports using different hypervisors on different pods (a collection of hosts), so working in tandem with a KVM or similar free hypervisor is an option. Migration options (assuming we go with KVM or similar): A) Dual hypervisor mode (use both WS and KVM, only allot new VMs on KVM?) B) Migrate WS boxes to KVM (Qemu-KVM supports this natively with VMWare version 6/7 disks) if A, then we need to use separate pods for WS and KVM. if B, then we pull boxes offline, one by one, move the images to the new host and KVM can handle the images. Tentative proposal for future VM management: Create one or more hosts with KVM in CS(or OS), assign a pod to the old WS clients, use Apache LibCloud within an LDAP-authed site (TBD) where PMC members can request, restart, get access to, and resize (to be acked by infra) instances. Liaison with Tomaz Muraus(LibCloud), Chip Childers(Cloudstack) etc on the actual implementation details. This would mean that infra's only role would be to ack the creation/resizing of VMs and general oversight, rather than manual creation/modification of each VM. I expect to have a mockup of what such a site could look like ready for infra to review and discuss medio December, thus adding something of value to the next board report about it. Jake Farrell has offered to help with the CS setup, as he has experience running this in large environments. There have been some discussions of maybe using other management platforms instead of CloudStack, but given that CloudStack and LibCloud are Apache projects, it is my opinion that we are easier suited, support-wise, by using software developed by the foundation, as well as the proverbial "eating our own dog food".
Discussed funding a pair of contractors to attend a Cloudstack conference to gain additional skills- approved by VP Infra. Only one will actually attend. Acquired a free license for Jira Help Desk - rollout forthcoming. Installed wildcard SSL cert for *.openoffice.org. In pursuit of outsourced code-signing capability for project releases. Negotiations have reached the NDA phase. Migrated the bulk of our SQL infra to a centralized database server. Discussed replenishing our Mac build infra. Purchased a wildcard cert for *.incubator.apache.org. 3 years at $475 per year. Began holding informal weekly meetings via google hangouts. Open to all infra-team members. Had a configuration regression regarding the PIG and DRILL Confluence wikis, which allowed additional spam to reappear on those spaces. Apachecon.eu DNS reacquired from our registrar. Somehow it wasn’t configured to autorenew so we lost that domain for a few days. We are still considerably behind the curve in our Jira workload and that is starting to inform some of the reporting at the board level. Please be patient while we continue to ramp up with existing personnel to support the org’s continued growth. In response we have organized a monthly jira walkthrough day dedicated entirely to outstanding jira requests. Raw jira stats show we have made significant progress over the past month and we expect that trend to continue, with 116 opened vs. 166 closed. Aegis is reporting a bad disk and it needs to be replaced as the host is seriously underperforming in its current state. Dell has solicited a warranty renewal offering for arcas, our jira server. We need to sort out licensing for our VMWare infra as we are currently in a holding pattern for new VM’s until this gets resolved. We’ve disabled the user ability to edit their profile page in confluence, eliminating another common source of spam.
An onslaught of Confluence spam required us to change the default permission scheme to match what we've done for the moin wiki. We've also formally withdrawn all support for the autoexport plugin. Closed out the account of a deceased committer. Received delivery at OSUOSL of the gear we ordered last month. Now in the process of bringing it online. Upped the default per-file upload limit for dist/ to 200MB (from 100MB).
Discussed recent scaling issues with roller and made appropriate adjustments to the install. Discussed picking up an SSL cert for openoffice.org. Dealt with disk issues in the Y! build farm. Dealt with high CPU consumption on the moin wiki. Dealt with a vulnerability on the analysis VM. Dealt with disk performance issues on erebus (VMware). Discussed creating a dedicated database server again. Dealt with a wide brute force password guessing attempt against our LDAP database. About 800 users were impacted, none of them apparently had their passwords guessed. Replaced a bad disk in hermes (mail). Setup an organizational account with Apple to allow devs to put their wares in the App Store. Cordova will be the first guinea pig. Ordered some new Dell gear for slated for replacements of existing hosts. Trying a new supplier largely for cost savings. Trying unsuccessfully to get our free VSphere license updated. A contributor inadvertently included customer data in two bugzilla attachments, and politely requested that it be removed. This request was initially denied based on a careful reading of the current policy. Subsequently, the author of that policy described his intent, the contributor provided more information as to what they were requesting to be removed and why, and as a result, the request was implemented. The infrastructure team plans to revisit whether or not the policy needs to be updated. As to the potential policy change, there is a saying in legal circles that "Hard cases make bad laws" -- as well as a saying that "Bad law makes hard cases". To some extent, both apply here. The overwhelming majority of requests for deletions are for people who want something removed from a mailing list that is widely archived and mirrored. Often these requests come in after a considerable period of time has elapsed. For these reasons, it probably is best that the documented policy continues to set the expectations that most requests will be denied -- and further I believe that we should be open to granting exceptions whenever possible. Reflecting on (a) the low frequency with which exceptions will be granted, (b) the amount of effort it took to resolve this, perhaps the simplest thing that could possibly work would be an addition of a statement like the following: Exceptions are only granted by the VP of Infrastructure; request for removal of items that have already been widely mirrored outside of the ASFs control are unlikely to receive serious consideration.  http://en.wikipedia.org/wiki/Hard_cases_make_bad_law http://en.wikipedia.org/wiki/Hard_cases_make_bad_law#Bad_law_makes_hard_cases
A report was expected, but not received
Discussed the logistics of bringing the ACEU 2012 videos online. Dealt with the fallout surrounding the recent javadoc vulnerability. Discussed upgrading our VSphere license with VMWare. We continue to iron out kinks with our new circonus monitoring service. Added a brief mission statement here: http://www.apache.org/dev/infra-contact#mission Upgraded svn on eris (svn.us) and harmonia (svn.eu) to current versions. Discussed making shell access from people.apache.org an opt-in service.
Purchased 9 disks from Silicon Mechanics to fill out the eris array - cost ~$1400. Promoted Jan Iversen to the Infrastructure Team. Had oceanus (new machine) racked in FUB. Mark Thomas kicked off a new round of FreeBSD upgrades. Disabled CGI support for user home directories on people.apache.org. Purchased a wildcard cert for *.openoffice.org at $595/year from digicert. Daniel Gruno was given root@ karma and will need to be added to the committee as a result. Setup a new VM with bytemark for circonus-based monitoring. Work continues around the Flex Jira import problems. Acquired several new domains for management by us instead of external parties.
Completed our budget deliberations including funding for a new part-time position. Purchased 3 new HP switches to replace our aging Dell switches. Cost ~ $4700. Continued discussion of code-signing certificates for our projects. Dealt with some failing/overloaded build machines in our Y! farm. Jan Iversen continued to work on our nagios -> circonus service monitoring migration. 2 disks have failed in loki (tinderbox), we've replaced one from inventory but will need to order more to complete the replacement. Experienced some security / porn issues with the moin wiki and have upgraded to the latest version to assist with controlling the spam. We will be disabling password-based ssh access to people.apache.org in the near future, once the supporting scripts have been tested. Rainer Jung was granted root karma and needs to be added to the formal committee roster.
A report was expected, but not received
About to spend ~$1500 for additional drive capacity for eris (svn.us). Updated our inventory with Traci to better align with their power-cycling service. Set up wilderness.a.o lua playground for Daniel Gruno. Granted danielsh and gmcdonald IRC cloak granting karma. Upgraded bugzilla to 4.2.5 in response to a vulnerability announcement. Jira was upgraded to the latest stable (5.2.8). Daniel Gruno setup a direct SMS service for root-users to take advantage of. Again discussed timing issues surrounding the dissemination of authoritative declarations about newly minted TLPs. Setup paste.apache.org - a pasting service for committers to use. Still dealing with the fallout of our failed rack-1 switch. Now pursuing indirect support with Dell. Received/deployed the new box for an additional vmware hosting service. Upgraded httpd on eos and aurora(www/mod_mbox service) to 2.4.4. Restored archiving for the EA mailbox. Jan Iverson upgraded mediawiki for Apache OpenOffice due to an announced vulnerability issue. Working with concom to setup a USB disk with video data on it in one of our OSUOSL racks. Pruned the apmail group list down to the relevant current active folks.
Placed a $15K order for a new vmware server with Dell which is now on backorder through February. Still dealing with the fallout of losing one of our public switch interfaces. Uli is working on OOB access for us at FUB. Specced some additional drive capacity for eris. Discussed setting up apaste.apache.org as a pasting service for Apache based on Daniel Gruno's apaste.info site. Started the process of reigning in the abusive maven traffic to svn.apache.org. Started the process of dealing with the missing Flex attachments for a Jira import. Shut off the people -> www rsync jobs for our websites. All project sites now MUST be on either svnpubsub or the CMS to continue to be maintained. Enabled redirects for our svn.apache.org services for graduated podling trees. Upgraded the software on adam (OSX) for $40 thanks to Sander Temme. Was contacted by Traci to update our inventory with them.
Rainer Jung and Jan Iversen have done a stellar job of coordinating and collaborating on a new wiki host for openoffice, among other activities from Rainer in particular. Discussed with the secretary the best way to populate a new tlp's LDAP groups from either the unapproved minutes or the agenda file. After a bit of back and forth we settled on the agenda file for the time being, however this effort remains a convenience for the incoming chair, not a means of setting up the groups in a permanent and official way. The chair remains responsible for vetting the groups post-setup. Had a robust yet not entirely satisfying discussion with the membership about relaxing ACL's for our subversion service, culminating in the following url http://www.apache.org/dev/open-access-svn It is expected to take an extended period of time before such changes can be effectively implemented as infra policy, but the goal of simply raising awareness has been met. We are in the final phases of withdrawing support for rsync-backed websites, which we expect to complete before the end of the month rolls around. At this time there are still several outstanding projects who have yet to file a jira ticket with us to migrate to either svnpubsub or the CMS, and their ability to continue to service their live site with updates and new content will be impacted. Daniel Gruno and others have been working on a gitpubsub service over on github and have rolled out a demo version to our live writable git repos. We expect even more coordination between the svnpubsub service maintained by the subversion crew and the gitpubsub service Daniel and others have worked on.
Lost our public VLAN in our rack1 switch for undetermined reasons, probably due to a misconfiguration on the OSUOSL side. Will continue followup with OSUOSL for eventual resolution. Enabled core dumps on one of our mail-archives servers to better diagnose the nature of the ongoing segfaults. Specced a new VMWare host to offer additional VM's to our projects, then haggled with each other over the config. Now appears we're going to repurpose chaos (36 disk enclosure) to serve up a Fiber Channel interface to the new host. The tlpreq scripting is now in place and ready for new graduating projects to use. We will pass along the details to the Incubator for podlings due to graduate in December. Decided we're comfortable with the OpenOffice project keeping at most 2 releases on the Apache mirror system (/dist/) at any one time. Henk Penning has communicated this back to the AOO PMC. Came across a bizarre privacy information leak with Jenkins and LDAP. We've patched our installation to mitigate the issue. Discussed our near-term plans for git hosting on various lists. One of the exchanges was needlessly heated, and we have tried to rectify the situation with better documentation and a bit less BOFH tactics.
Restructured the creation of new mailing list infrastructure: new "foo" lists will be named following this convention: foo@$podlingname.incubator.apache.org, instead of the now antiquated "$firstname.lastname@example.org". Similarly restructured website assets for new podlings to use http://$podlingname.incubator.apache.org/ instead of the prior "http://incubator.apache.org/$podlingname/". These changes will help make migration to TLP status easier for both the podling and the Infrastructure Team. Coordinated with Sally with respect to the Calxeda/Dell ARM donation. Worked with OSUOSL to mitigate the downtime of a series of scheduled network outages affecting our .us services. Discussed the pros and cons of using Round-Robin DNS for websites. No action taken moving us away from RR DNS. Rainer Jung upgraded our webserver install on eos (www.us) to the latest and greatest version of 2.4.x. Mark Thomas upgraded all 3 bugzilla instances in response to a security vulnerability report. Some generic stats detailing the org's recent growth: New committer intake: ~300 / year like clockwork over the past decade. New TLP graduations: 2010: ~20 2011: ~10 2012: ~30 New INFRA issues: http://s.apache.org/INFRA-Creation-2005-2012 Mailing List / Subversion activity: http://www.apache.org/dev/stats/ [*] Average Subversion traffic (hits): consistently ~ 3.2M / day for the past few years. Inbound Mail traffic: 250-300K connections per day for the past few years. Average Website traffic (hits): Nov 2010: 10M / day Nov 2011: 11M / day Nov 2012: 21M / day [**] Rough Download Page traffic (hits): Nov 2010: 48K / day Nov 2011: 46K / day Nov 2012: 145K / day [***] 46 Virtual Machines (18 new within the past year) and 24 additional ARM servers due to the Calxeda/Dell donation. [*] - clicking on the mailing list graph shows incubator + hadoop + lucene is now responsible for 40 % of the org's total mailing list traffic. [**] - 10M / day due to www.openoffice.org. [***] - 100K / day due to openoffice (note openoffice users typically upgrade using the openoffice software itself rather than by visiting the download webpage)
AI: Sam follow up with regard to git post-commit hook support by infra
Expressed some concerns about the ongoing volunteer support and documentation for nexus (repository.apache.org). Spoke with a few PMCs about cleaning up temporary artifacts in their website rsync ops. Discussed infra meetup plans to coincide with Apachecon EU. Due to the fact that the CIA.vc service was shut down, we are considering other options to provide the same functionality to our projects. Dealt with some issues surrounding the generation of projects.apache.org pages. The contract for colocation service with FUB has been signed by FUB and awaits our counter sign. Started discussing various approaches to simplify the podling graduation process from an infra standpoint.
Daniel Gruno added to the Infrastructure Team. Bought a pair of 1TB SATA drives, one of which was used to replace a bad disk in hermes (mail). Calxeda donated access to a 24-node ARM-based hosting service to be deployed as part of our build farm offerings. Removed the custom jira patch licensing plugin largely because no one wanted to continue to maintain it. Migrated jira service to its own hardware (a spare r410) for stability reasons. Picked up a 7 more SATA drives for inventory and replacement. VP Infra pointed out various rumblings rising to the board level about contractor communications, and had various ideas about how to address that. Started work on a Circonus-based service to replace our aging nagios installation.
AI Jim follow up with infra regarding git status
No report was submitted.
Will report next month.
Updated the mailing list creation process- see https://infra.apache.org/officers/mlreq Determined that eve.apache.org, one of our Apple Xserves, is no longer capable of productive service due to various hardware faults. Discussed acquiring additional "cloud" services for our projects to use, ostensibly thru some unspecified bidding process. Nothing much came of it. Approached by Dell regarding warranty renewal for selene and phoebe (Geronimo TCK build farm). We declined. More discussion, much of it less than constructive, about providing a digital signature service to Apache project releases. Worked out a deal with Calxeda to provide a few ARM-based build servers for our projects to use (at no cost to us other than admin time). Granted Philip Martin of the Subversion project access to eris (US svn server), mainly for his offer to help with some svn server debugging. Working with our main DNS provider no-ip.com to stabilize our account services to better deal with the dozens of extra domains the AOO project needs us to host. We've been getting gratis service to this point, but we are willing to pay for better responsiveness and additional features only available with a paid support plan. Discussed plans for an infra meetup to roughly coincide with ACEU. Work on the backup system migration from bia (in Los Angeles) to abi (in Fort Lauderdale) is nearing completion. Some progress was made on getting the number of outstanding jira tickets down to normal levels. We've reclassified tickets based on whether they are "waiting for user" input or "waiting for infra", which has helped, but the bulk of the open tickets still are "waiting for infra". We expect to continue to make progress on this over the coming days and weeks, and will continue to report on it until we are satisfied things have returned to an acceptable state. Daniel Gruno put together a nice comments service at comments.apache.org for project websites to take advantage of.
blogs.apache.org has returned to normal service post-upgrade; thanks to Dave Johnson for the particulars. Migrated our jira instance to our shiny new phanes/chaos VM cluster and it seems to be performing rather well now. Thanks go to Dan Kulp for debugging the svn plugin for us, as well as culling the jira-administrators group to sane levels (projects will need to make better use of roles). Upcoming work to include flushing the backlog of pending jira imports, which we've now started with Flex. Discussed options for "Cloud support" for a certain GSOC project. Conclusion was that we don't currently have a suitable arrangement worked out with a cloud provider to offer the ASF the enterprise setup that we'd require. Did the password rotation dance again- this time there were no malicious activities surrounding the action. See http://s.apache.org/zZ for details. The situation has since returned to normal now that we've reenabled committer read access to the log archives on people.apache.org. Discussed java hosting futures with members of the FreeBSD community in light of the fact that Atlassian does not consider FreeBSD a supported platform, which at least partially motivated our move of jira from a FreeBSD jail to an Ubuntu based VM. imacat added to the Infrastructure Team to help support the ooo wiki and forums platforms. Work on new harmonia is currently underway- colocation provided by Freie Universität Berlin (FUB). Uli Stärk is our lead on this. Decided not to pursue a meetup at the Surge conference this year, preferring to get together at one of the upcoming Apachecon conferences. Discussed support status for git and documented our plans for bringing it to a fully supported service over the next 3-6 months, culminating in the following awkwardly tautologous statement by VP Infra: "The infrastructure team has four full time contractors and a variable number of volunteers and is committed to supporting both git and subversion." Experienced extended downtime for reviews.apache.org after an OS upgrade busted our install. Dan Dumont from Apache Shindig has been assisting us in recovery- we expect the service to return to active status by the time of the board meeting. We've fallen a bit behind in our caretaking of jira issues mainly due to a number of new graduations from the Incubator. We'd like to return the number of outstanding issues to "normal levels" within the next reporting period, which seems a reasonable goal given the expected (small) number of new graduations happening this month. We upgraded people.apache.org (aka minotaur) in light of the recent security reports from FreeBSD concerning a local root exploit. Considerable work has gone into scripting various workflows around common requests like mailing lists, git repos, and CMS sites. We've subsequently created infra.apache.org to house these efforts once they've fully gelled from their development versions at whimsy.apache.org. Daniel Shahaf is stepping back to part-time for three months starting in July.
Added Mohammad Nour El-Din to the Infrastructure Team. Upgraded Jira to version 5- kinks still being ironed out. Had OSUOSL install our recent purchases from Silicon Mechanics and Intervision. Coordinated with the OpenOffice PPMC and Sourceforge regarding the distribution of AOO's 3.4.0 release. Most of the traffic was handled by Sourceforge's CDN instead of our mirror system, at a rate of over 20 TB of download traffic a day. Download stats will be published by the AOO PPMC in the near future. Replaced isis's bad disks and brought up 2 additional build hosts at our Y! datacenter. Discussed another F2F meeting during the Surge 2012 conference. Not a lot of expressed interest so far. Experiencing uptime issues with our monitoring.apache.org Bytemark VM ever since they migrated the VM to different hardware. Discussing git hosting options again, and again, and again...
Coordinated the installation of aurora (www.eu) in SARA with Bart van der Schans. We are now out of free space in that datacenter. Our new backup server abi is in service at Traci. We've arranged for Sam to have access to the joes-local safe deposit box in an emergency. Worked out an informal deal with SourceForge to assist with the delivery of OpenOffice releases. Whether or not this continues to be used beyond the first release is up to the AOO PPMC. Henk Penning is putting the final touches of providing optional Apache-mirror support for OpenOffice releases. Bought an array from Silicon Mechanics for about $12K. The host to attach it to will be purchased through Intervision in the very near future. Updated the AOO bugzilla logins to reflect the fact that we host the openoffice.org domain but no longer allow mail for it. Considering alternatives to directly importing Flex's jira data into the main jira instance due to repeated failed attempts. Atlassian refuses to assist us until we can demonstrate the identical problem on a supported platform. Submitted our budget for review and approval. Harmonia's (svn.eu) disk subsystem finally stopped performing well enough to continue using it as our European svn slave. Specced a replacement host with Uli Stark's help. Fleshed out a deal with Freie Universitat Berlin for colocation services, targetting new harmonia as the first host to deploy there. Set a soft limit of a combined 1GB worth of release artifacts dropped onto the mirror system- anything exceeding that figure needs to coordinate with infrastructure in advance.
All outstanding bills have been paid. Spoke with the myfaces PMC about their public-facing maven repo on their zone. The zone has been taken down. Discussed replacement purchase for harmonia (svn.eu) and and additional VMware server for linux vm's. Upgraded the majority of our FreeBSD servers to 9.0. Will complete the remaining ones in the near future. Next time round we will likely enable dtrace throughout. Discussed infra's authority to pull improperly-signed releases from the mirrors. VP Infra concurs we can/should do this if appropriate justification is available. Mark Thomas upgraded our Bugzilla installs to 4.0.4. Dealt with our monitoring host being redeployed to a new VM server. Discussed hosting a yum repository for project releases. No decision was made at this time, pending further input from volunteers. Discussed deploying an Australian svn mirror to Gav's local server. No decision was made at this point. Greg Stein has taken svnpubsub over to Subversion's trunk and has done a lot of work on improving the svnwcsub service. We look forward to seeing svnpubsub distributed in a subversion release! Established the precedent of not resetting accounts for returning committers that are no longer a part of any active project. Experiencing chronic problems with isis, one of our build hosts in Y!'s datacenter. Y! has offered us a pair of new servers to supplant it. Began discussion of a budget for FY2012-2013. Posted a few CMS-related blog entries to http://blogs.apache.org/infra describing recent activity.
Still attempting to pursue a github FI instance for ASF use. Intervision LOC app has still not been filled out and sent off. Purchased an HP switch for SARA for 456 EU. Renewed warranties on Dell and Sun gear for 1Y with Intervision and Technologent, respectively. Prepped aurora and the HP switch for installation in SARA. Renewed apache.org DNS for another 9 years. Silicon Mechanics reminded us of our outstanding $2037 credit with them. We've sent out a notice to all PMCs about the plan to migrate all sites to svnpubsub or the CMS by the end of this year.
Still attempting to pursue a github FI instance for ASF use. Intervision LOC has still not been filled out and sent off. We've determined there is adequate space (9U free) for us to install our new gear in SARA. However we are in need of an additional switch, which is tasked to Uli Stärk for purchase. Awaiting a final decision from the OpenOffice podling regarding hosting of extensions and templates. They are considering either hosting those services locally at the ASF or with SourceForge. Joe spent a few hours training Melissa (EA) on svn use. A google calendar is now available for everyone's use thanks to Melissa. Secured and rolled out a wildcard cert from Thawte. Dealt with a minor security issue in the Nexus installation, reported by Sebastian Bazley. Partitioned our SSL termination for our linux hosts to separate VM's for additional security. Began deploying puppet for eventual management of our linux hosts. Dealt with a benign security incident on modules.apache.org. Further enhanced CMS performance with the introduction of parallel builds.
Floated the concept of providing Sam Ruby a place to supply a grab bag of useful CGI scripts he has developed over the years. Originally hosted on the id.apache.org server as rubys.apache.org but is in the process of being redeployed to a separate VM named whimsy.apache.org. Rejiggered switch 2 (our "private" switch) with OSUOSL help to better align it with our goals of a single switch per cabinet. Setup tethys.apache.org for dedicated service to the OpenOffice extensions and templates services. As these services include providing downloads for non-open-source licensed products, explicit permission by VP Infra for this purpose was granted. VP Infra delegated the final decision on licensing to the Incubator PMC once OpenOffice proposes to graduate. Setup translate.apache.org which is a pootle-backed service for projects to use in their language translation efforts. Initiated "Git Friday" to focus on git-related support issues across the whole team on a weekly basis. Discussed installation of new aurora into SARA with Bart van der Schans and Wim Biemolt of SARA. Progress is gated on Wim getting back to us on the status of our to-be-determined additional rack space. Tony Stevenson put in his notice to return to paid contractor status starting Jan 1. Opened the floor to input on contract terms for sysadmins, in particular the immediate part-term post(s). General feedback on the idea of including explicit metrics in the terms was negative. Wrote a testimonial for the FreeBSD Foundation regarding our FreeBSD deployments. Gave the CMS a big performance boost by replacing the rsync usage with zfs clones. Put out an RFP for participation in our ongoing alpha test of git hosting. 7 projects responded; all 7 were accepted: wicket, callback, cassandra, s4, cassandra, deltaspike, trafficserver, deltacloud. Participating projects are required to provide an infra volunteer to work on the git hosting code. Got Thawte's nod for a wildcard cert via Bill Rowe's contacts (waiting on Sam Ruby to chase up Brian the sales rep).
Accepted a "loan" from IBM for a PPC64 server to be incorporated into our build farm. Welcomed the new VP Infra Sam Ruby to the leadership post for Infra. Changed our plans from simply replacing the bad disk in harmonia to replacing the entire host asap. Completed the migration of the openoffice.org domain to ASF control. Discussed mirroring options for the large set of artifacts produced in OpenOffice releases. Registered the apachecon.eu domain with our dotster account. Held a minor infra meetup at Apachecon led by Philip Gollucci to discuss transition issues for the VP role. Discussed a few proposed mail templates to send out regarding our svnpubsub migration plans for dist files and websites. Pushed to resolve a few outstanding DNS migration issues for the subversion, spamassassin, and libcloud domains.
Board Action Items: =================== Intervision LOC needs to be signed,filled out,and sent approved/dell-2011-05.pdf needs to be moved to paid by treasurer@ approved/Noirin-Infra-flights.txt needs dealing with No other items are expected to be changed in Staff SA contracts. Need signature by an officer who is aware of the new terms. General Activity: ================= The harmonia (svn.eu) bad root disk situation remains unresolved. We expect to purchase a replacement disk with Uli's help and have Bart van der Schans install it shortly. Bart has received and tested the replacement server for aurora (www.eu). We need to schedule a date for decommissioning and eventual reinstall of aurora soon. Held an infra meetup to coincide with the Surge conference in Baltimore, MD. Discussed plans for the remainder of the fiscal year. Tabled the staff review b/c the president could not attend. We'll reschedule. See notes here -- http://s.apache.org/surge2011infra Pursuant to explicit authorization from the board, started the git hosting experiment with CouchDB as the initial guinea pig. Renewed the spamassassin.org domain for 1 year. Updated people.apache.org's operating system in light of the published local-root vulnerability in FreeBSD. Other FreeBSD hosts are scheduled to be upgraded to FreeBSD 9.0 on release. Began work rationalizing the state of our switches at OSUOSL. The current situation is a mess, we're moving to a one-switch- per-rack configuration with no cross-cabling between racks other than through the patch panel. Re-initiated the transfer of the openoffice.org domain to the ASF's dotster account. Renegotiated the financial terms of Gavin's and Joe's contracts. Began talks for another infra meetup sometime during Apachecon NA 2011.
New Karma: ========== Finances: ========== Payments to staff were about 1 week late this month. Board Action Items: =================== Intervision LOC needs to be signed,filled out,and sent approved/dell-2011-05.pdf needs to be moved to paid by treasurer@ General Activity: ================= Response time on new account requests remains under 2-3 days. The harmonia (svn.eu) bad disk situation remains unresolved at this time. Ordered a replacement host for aurora (www.eu) from a Dell reseller in Germany. Cost = 5259.80 EU. It is to be shipped to Hippo for eventual installation by Bart van der Schans. Daniel Shahaf improved our automated banning of abusive IP addresses with respect to svn traffic. Failed to successfully incorporate Terry Ellison of the openoffice.org (ooo) project into the infrastructure community. Terry was working on migrating the existing wiki and forum services for that project to ASF gear, but gave up after being frustrated with his interactions with the ooo community at the ASF and the infrastructure team in particular. His volunteer efforts will be missed. Mark Thomas successfully migrated the ooo bugzilla instance to ASF gear. Mark Thomas also improved our svn traffic banning schemes. Upgraded our relative state of paranoia following breakins to kernel.org and linux.com. Lost a disk in hermes' (mail) zfs array which was subsequently replaced with an existing spare in the rack. We need to look into purchasing another spare of the same specifications for future disk failures, as there are none left for us to use in the rack.
Setup and racked all the gear for the backup solution at TRACI. Philip flew down to assist, and we setup a safe deposit box to store tapes offline. Harmonia (svn.eu) lost a root disk and reported errors with its zfs array. The errors were subsequently cleared but we need to look into a replacement root disk for the one we lost. After more complaints about delays in the account creation process, Sam Ruby created a script to automate the input processing for new requests. Together with more root people participating in the process, this has significantly cut response times from several days to just a few. Upayavira continues to work on the selfserve infrastructure, which will someday completely replace the existing account creation procedure. Uli Stark is finalizing the specs for the EU replacement host for aurora (www.eu). Setup for the Openoffice MediaWiki install as well as the Openoffice forum site has made significant progress. Terry Ellison has led the effort for both. Andrew Bayer approached us with an offer to provide either a cash donation or hardware donation for additional build slaves. Serge explained the situation with targeted donations and our past experiences with hardware donations at the ASF. Daniel Shahaf upgraded our svn installation to 1.6.17 on eris, thor, and harmonia in order to prevent further loss of commit emails. Cleaned up the root@ alias, adding the President Jim Jagielski to it. Doubled our available RAM on sigyn (jails) in the hopes of improving stability of the host. Bumped the secretary's LDAP karma to a level at least on par with a PMC Chair. Niklas Gustavsson was granted infra karma for his work on our Jenkins build infrastructure.
LDAP + TLS + LPK work is underway. Justin Erenkrantz stepped down from root@. Ordered 1y Warranty extension for selene and phoebe (Geronimo TCK builds). Organizing an infra meetup to coincide with the Surge Conference in Maryland next month (http://omniti.com/surge/2011/). Got listed in SORBS again. Subsequently filled out the delistment forms so all is well again. Brought modules.apache.org in-house to better deal with the spate of XSS vulns. Started coordinating with existing openoffice.org sysadmins to plan for eventual service migration. Part of the backup system order has arrived in Florida. The remainder is due to arrive by the end of the week. Arranged for travel from VA to FLL for Philip to help out with the racking and safe deposit box for the backup system. Work on upgrading id.apache.org to handle acct creation is progressing. Uli Stark specced a replacement host for aurora (eu websites).
AI: Sam to follow up on Philip's missing credit card
Upgraded MoinMoin wiki to 1.8.8. Mark Thomas Upgraded Jira to 4.3.4. Niklas Upgraded Jenkins to 1.413 and moved it to a more stable URL (removing hudson from the URL). Took another whack at cleaning up /dist dirs. Current status is appended to the end of the report. VP visited OSUOSL to survey the racks. Looking to purchase additional RAM for sigyn (which keeps panicing because of lack of RAM). More XSS vulns reported against modules.apache.org, which we don't physically host. Ran into a snag while placing the order for the replacement backup system. At this point we need the treasurer's sign-off on a Bank & Trades agreement with Intervision who will subsequently provide us with Net-30 terms for the order. OSUOSL has notified us they are running low on cooling capacity, which may affect our ability to host new machines at their datacenter. OSUOSL wants us to purchase a switch to rationalize their cabling system for our 3rd rack. We plan to accommodate them once the backup system has been purchased. We are purchasing a warranty extension contract from Dell for selene and phoebe, our Geronimo TCK build hosts in Traci. ----- Status of PMCs/podlings that had not completed /dist clean-up by third reminder Ignored all three e-mails to the PMC: TLPs: buildr, cassandra, click, couchdb, logging, spamassassin, synapse, tcl, tiles Incubator (Graduated): buildr, river Incubator: deltaclound, manifoldcf, olio, vcl, whirr In-progress: chemistry - partially complete, dotcmis 0.1 needs to be removed empire-db - partially complete, 2.0.7 needs to be removed ws - partially complete, axis-c 1.5.0 needs to be removed santuario - argued, did nothing openwebbeans - argued, did nothing Fixed and confirmed on third reminder: abdera, bval, cocoon, esme, hbase, hive, libcloud, nutch, oodt, perl, portals, qpid, roller, shiro, thrift, uima, wink, xmlbeans, xmlgraphics Should not have been on third reminder list: jackrabbit
Migrated people.apache.org (aka minotaur) to HP gear. Discussed changing our Dell rep to a reseller better suited for our business. Received Technologent invoice via OSUOSL. Started the initial steps of transferring ownership of jini.net to the ASF. Tony Stevenson was brought on board as our third contracted (part-time) sysadmin. Setup ACLs to allow PMC members to browse the archives of their own private lists. Setup infrastructure for storing PGP fingerprints in LDAP. Discussed the addition of a benchmark-running host for projects to use, particularly lucene. Continued pursuing the offenders on the XSS list reported to us by security@ last month. Scheduled an infra meetup to happen at the Surge conference at the end of September. Investigating some uptime problems with sigyn (tlp zones). Received a pair of D53J JBODs from Silicon Mechanics, with a partial refund (directed to treasurer@) for delays and lower performance drives. Addressing management of dist/ trees as requested by Greg Stein
It was observed that some account creation requests were not done for over a month. Prior experience was that accounts were processed weekly.
Should not happen again.
Cancelled our incorrect order for warranty support with Dell. The order was subsequently replaced with a lower-cost version ($1000), covering the same equipment for a 1y term. Sent a purchase order to Technologent for 1y warranty support on thor (confluence) and gaea (zones). Purchased and received a Dell r210 for $1700 for managing our VMWare VSphere installation. Corrected our contact information with Dell for the umpteenth time. Had OSUOSL deal with the mis-shipment from HP. Correct replacement parts are expected to arrive soon. Provided a little mod_rewrite magic to simplify our main webserver config. Upgraded Confluence in advance of a public security notice. Purchased a pair of Intel nics: 1 to go into the HP gear and 1 for emergencies. Tweaked the network config on eris (svn) and eos (websites) to alleviate chronic downtime problems. Made an aborted attempt to upgrade minotaur (people) to new gear. Will be repeating that process shortly, this time using our HP donation as the target host. Mohammad Nour was granted karma to help Paul Davis sort out git hosting. Received a comprehensive XSS vuln report for several services we offer. Still working through the list. Offered Ryan Pan infra-interest karma for his penchant for running vmstat on minotaur (people). Renewed our service warranty with Dell for 2y covering hermes (mail) and eris (svn). Price: $1200. Sorted out the confusion regarding the iLO license for the HP gear. pear.apache.org was requested by a couple of PHP projects, it is now live ready for projects to put PEAR releases on. Citing a lack of time, Philip, the current VP, asked for someone to be appointed by Jim, president. It seems unanimously agreed, at this time, this is not in the best interest of the foundation or the infrastructure team. Instead, some of this role's responsibilities, which are too much for a volunteer over the long run, are going to be doled out to a new part time paid contractor and the current full-time staff. Its hoped this will reduce the workload of the role to about ~5hrs/week. This is reasonable for an active volunteer. We will re-evaluate this after the new position is up and running. Philip is able to continue with the VP position for now with this reduced work load.
Upgraded ALL of our FreeBSD hosts to 8.2-RELEASE except for minotaur, which is slated for replacement later this week. Paul Davis was granted infra karma for his work on git hosting. Reigned in the use of Jira accounts associated with mailing lists, predominantly for security reasons. Submitted a budget to the budget committee for 2011. The long-awaited gear from HP has arrived at OSUOSL. We are in the process of having it racked and brought online in our new 3rd rack. (Unfortunately some of the drives are incompatible with the hosts, so we are awaiting additional gear from HP to resolve this). Migrated user .forward files into ldap. Ldap now is authoritative for forwarding addresses. Rationalized much of our vhost config for the tlp websites using mod_vhost_alias. Initiated a general cleanup request to tlp's with large /dist/ directories. OSUOSL has upgraded our bandwidth cap to 50mbps inbound 100mbps outbound (up from 10 and 50 resp). Our expected (and paid for) arrays from Silicon Mechanics are delayed 3 weeks pending arrival of the requested Hitachi drives. Upgraded our svn servers to deal with announced DoS vulnerability. Spoke with one of our Dell reps about pricing inconsistencies in our new service contract. It should be resolved in the near future (in our favor). Arranged for an updated quote from Technologent for service on our remaining Sun gear.
Renewed service contract with Dell for 1 year regarding baldr and our 2 PowerVault arrays. Received a VAT invoice for ~ 900 EU for the array we recently shipped to SARA. Awaiting a wire from the treasurer for payment. Specced a pair of D53J JBODs from Silicon Mechanics. Awaiting a wire from the treasurer for payment. Started discussing next year's budget. We now have a 3rd rack to use courtesy of OSUOSL. Reworked the asf-do.pl script to overcome issues with opie. Shut down the portals zone for security issues. Deployed ckl- "CloudKick logging tool" to all our FreeBSD hosts. See http://www.apache.org/dev/ckl.html for details. NERO, our network provider at OSUOSL, discovered an open HTTP proxy on one of our hosts. Upon further investigation we closed several other outstanding security issues and removed root access from those responsible for the poor setup. Paul Davis continued his work on git hosting, setting up a mailer for testing. Daniel Shahaf was promoted to root@ for his outstanding work in several areas. Upgraded Jira to 4.2.x and added GreenHopper (supports agile development) for all projects. We're also eating our own (fresher) dog food since Jira now runs on the latest Tomcat 7 release.
Fixed all outstanding issues with the backup system. Brought erebus online (one of the Dell's purchased last month), to serve as our main VSphere host. Setup a test instance of JIRA 4 in preparation for the 3.x to 4.x upgrade. Instituted a password policy which locks accounts for 24 hours after 10 failed login attempts. See https://blogs.apache.org/infra/entry/ldap_and_password_policy for details. Brought the CMS to a feature-complete 1.x state. It is now ready for wide-scale adoption, starting with the incubator; see https://blogs.apache.org/infra/entry/the_asf_cms for details. Updated our account details with Dell. Dealt with extended people.apache.org outage during the New Year holiday. Dealt with some wiki abuse reports from NERO regarding attachments. As a result we have disabled the feature across the wiki farm. Updated the LDAP scripts on people.apache.org to filter out redundant entries in all "modify" operations. RMA'd a failed drive back to Silicon Mechanics. Promoted Daniel Shahaf to enjoy root karma on minotaur (people). Daniel Shahaf setup our reverse ip zone master for our OSUOSL ip's with OSUOSL's dns server slaving off that. Specced a new JBOD array for service at about $7K. Brought id.apache.org online (props to Ian Boston, Daniel Shahaf, and Tony Stevenson); see https://blogs.apache.org/infra/entry/https_id_apache_org_new for details. Confluence upgraded to the latest 3.x version, courtesy of Gavin McDonald.
Ordered a pair of Dell r510's for $24.2K, slated to be used for jails and database hosting. Discussed JVM settings on the ofbiz vmware instance with the ofbiz PMC. Discussed previously offered HP donation of a pair of servers: one for www.eu and one for jails hosting. Agreed to accept the donation but will not seek to accept hardware donations in the future. Had Y! replace a pair of failed disks in minerva (builds). As it's a RAID 5 array we had to reinstall the OS. Patched our JIRA installation in response to the recent security advisory from Atlassian. Gavin McDonald did an analysis of our backups to date, pointing out specific backups that are not going thru cleanly. Bruce Snyder successfully arranged for a VSphere license from VMware. Discussed fact that Justin Erenkrantz has moved away from the UCI campus where our backup server is hosted. The issue needs further attention going forward as we need someone local who can change out the tapes and put the used ones into a a safe deposit box. Paul Davis made significant progress in our pursuit of read-write git hosting.
It was noted that several projects gave thanks to infra for their work this period. Well done!
We specced and ordered an Opteron-based Dell R515 machine for virtual machine hosting- cost ~ $12.5K. A FC card was thrown in (for reuse of the decommissioned helios array) at no additional cost. Chris Rhodes was given root@ karma. We held a team meeting on Monday Nov 1 during ApacheCon. Highlights were posted to infrastructure-private@. Daniel Shahaf was made a member of the Infrastructure Team. www.apache.org was converted to the CMS with new templates and stylesheets provided by Chris J. Davis. HP has offered to supply our replacement servers for SARA. Tony Stevenson is leading this conversation, as it could save us roughly ~$30K from our budget. Discussed the fact that Apple has EOL'd Xserves while we just recently racked a donated pair of them. Renewed the service warranty with Dell on baldr (jails) for 1y. Plan drafted on what it would take for Git to be able to be used as a project's primary source code repository. Currently dependent on volunteers driving the effort.
Shane: yay new web site!
Helios (zones) has been decommissioned, replaced by FreeBSD jails on sigyn. 5 new TLPs were processed: pig, hive, shiro, juddi, karaf. We still need to purchase an Opteron-based box per the sponsorship agreement with AMD. Tony Stevenson fixed our busted zfs-snapshot-to-tape script on bia (backups). Upgraded all our Linux hosts to deal with the announced local root exploit bug. Bruce Snyder continues to pursue a VMWare VSphere installation for us. Coding work on the CMS has completed: for details see http://www.staging.apache.org/dev/cms.html We are planning to go "live" during Apachecon once the new templates and stylesheets are available for www.apache.org. We received a DMCA takedown notice regarding some content in the logging and Hadoop wikis. The PMCs have been notified but have not reported back on their progress. Specced an FC card to allow us to reuse the helios array, providing sigyn with more storage space. Estimated cost ~ $1000.
The AMD box has not yet been purchased.
AI Jim facilitate the purchase of the box.
Gavin McDonald is in the final phase of migrating all necessary Solaris zones from helios to FreeBSD jails. Hudson master has moved to a new machine (aegis) and begun using LDAP, thanks to Tony Stevenson and Niklas Gustavsson. Jukka Zitting notes it may be time to start experimenting with running Hudson slaves on EC2. Sander Temme is preparing eve (Xserve) for hudson and buildbot usage. Daniel Shahaf patched the downloads script to deal with a potential XSS vulnerability. Unfortunately some stray code wound up in production due to anakia deployment issues, which took the downloads script down for several hours. Sander Striker signed a Dell "letter of liability" for Dutch equipment purchases for SARA. Stefan Bodewig was given full infra karma due to his vmgump contributions. Ulrich Staerk was given full infra karma based on his work on s.apache.org, jira, and confluence. We received 2 disks from Silicon Mechanics and replaced a failed disk in eos(www). The replaced disk will be shipped back to Silicon Mechanics under RMA. We ordered a disk array from Silicon Mechanics to be drop-shipped to Bart van der Schans in the Netherlands for eventual deployment at SARA. Cost was ~$5700. Started up a project for creating a custom CMS for Apache. Initially it will target www.apache.org, with something for people to review around Apachecon in November. Gavin McDonald proposed some new equipment purchases to build our our VM infrastructure. Upayavira completed the domain transfer for ofbiz.org. Ari Maniatis is pursuing hosting an svn mirror in Australia. Don Brown is pursuing the idea of getting support for the Confluence auto-export plugin. Go Don!
Daniel Shahaf and Dave Johnson and Niklas Gustavsson were granted infra-interest karma. Norman Maurer brought new athena (mx1.us) online. Sander Temme shipped the pair of Xserves in his possession to OSUOSL. They have been racked and are currently being configured for use. Ari Maniatis has been in touch with the University of Sydney for the purpose of both hosting a svn mirror and to provide facilities for an Apache conference/barcamp. We looked into the idea of holding an infra-thon before Apachecon in November but the timing didn't seem to work out. We will probably try again in early 2011. Began the process of speccing replacement hosts for our EU gear hosted in SARA. The replacement host for minotaur has arrived, been racked in OSUOSL, and is currently being setup by Philip Gollucci. Note that we are running out of current capacity (amps) in our racks. Gavin McDonald reached out to pmcs with zones on helios to tell them about our migration plans: replacing those zones with FreeBSD jails. Most have responded in a timely manner. Odin has been decommissioned and the vmware instances it hosted for vmbuild and vmgump have been transferred to nyx. One of our disks in the brand new eos array was defective. It's been taken out of service and is to be shipped back to Silicon Mechanics for replacement. We have also ordered a pair of spare drives from them as well. We ordered an array from Silicon Mechanics to be shipped to Bart van der Schans in the Netherlands for ~$6000. The array is to be part of our replacement plan for aurora (www.eu) in SARA. We've moved the Hudson master to a new machine (aegis) and begun using LDAP for Hudson, thanks to Tony Stevenson and others.
Shane asks when amps shortage will be critical. Philip replies that it's under control.
Approved by general consent.
Philip Gollucci performed a general cleanup of the filesystem on minotaur (people). Ok'd Jukka's plan to setup Gerrit for hosting an Apache Lab. Infra was notified of a compromised gmail account potentially hacked as a result of the jira hack on Apache. Joe Schaefer was called for 2 weeks of jury duty and the team capably picked up the slack in his absence. Kudos in particular to Gavin McDonald, Philip Gollucci, and Paul Querna. The new replacement machine for eos is currently online and serving traffic for www.apache.org and mail-archives.apache.org. Migration of the moin wiki will be forthcoming soon. Thanks to Paul Querna and Philip Gollucci for doing the bulk of the setup. The new replacement machine for athena is setup and should be brought online as mx1.us shortly. Thanks to Norman Maurer and Philip Gollucci for doing the bulk of the setup. Received a "paid invoice" notice from Network Depot for the SonicWall. Held some discussions between Mark Thomas and Philip Gollucci about how to set-up baldr, the machine destined to host issues.apache.org. Ordered a Dell R410 for ~$5000 to serve as a replacement for minotaur (people). Tony Stevenson made a few modifications to our LDAP tree to better service Hudson and similar apps.
Sander Temme has been busy configuring the pair of Apple- donated Xserves prior to shipment and installation at OSUOSL. Gavin McDonald did some repair work on our backup systems. Discussed the future of odin (vmware) while Mark Thomas replaced a failed disk in it. Mark Thomas was promoted to root@. Brought an uri-shortening service online: http://s.apache.org/details which takes advantage of LDAP + the Thawte-supplied wildcard cert. No progress was made in bringing up the purchased replacement host for eos. The new machine Aegis was brought into service to replace Ceres as the Buildbot Master, Ceres continues to be a Buildbot Slave. Work is underway to move Hudson.zones Master to Aegis also. Sebastian Bazley has taken up the charge to address a few key crons that require access to private svn urls.
Replaced a S300 software raid card with Perc 6i card in aegis (builds). Enforced a hard May 1, 2010 deadline for all admins to adopt OPIE on all Linux and FreeBSD hosts at Apache. Gavin McDonald upgraded Confluence to 3.2 - the latest available version. Dan Kulp was kind enough to patch the autoexport plugin this time round, but we need to take another serious look at phasing out confluence as a CMS, perhaps replacing it with Day's CQ5, in the near future. Initiated periodic password cracking program for our most sensitive passwords, particularly LDAP passwords. Early results identified some 60 accounts vulnerable to dictionary-style attacks, and those users have been contacted. Also notable was the identification of FreeBSD crypt as being a superior storage format for hashed passwords as opposed to SSHA, so we are in the process of phasing out SSHA for LDAP passwords. We are in the process of compiling a list of accounts with security issues that are no longer reachable via their apache.org email address. Those will be the first group of accounts we close out. We have cleaned up the root@ alias addresses and synced them with committee-info.txt. Notable changes were the removal of Roy Fielding, Ted Husted, Joshua Slive, and Erik Abele, and the addition of Gavin McDonald, Tony Stevenson, and Norman Maurer. Due to port restrictions and lack of console access to Y! machines, the Buildbot master was moved to the newly brought online 'aegis' builds machine. The old host 'ceres' remains as a buildbot slave. Hudson master is due to move across from Hudson.zones shortly. Noirin Shirley was voted onto the Infrastructure Team for her editorial work on the infra blog.
Norman Maurer upgraded several Solaris machines to the latest available version. Purchased 48GB RAM from Crucial slated for installation in the upcoming eos (websites) replacement host. Had OSUOSL ship Ken Coar the failed fireswamp (apacehcon) host. Installed new dell 48 port switches and moved all public dracs at OSUOSL to the private network. Ordered a Dell R410 (slated to replace eos) with external SAS card for $3800. Ordered a 12-disk JBOD from Silicon Mechanics for $6400. Ordered a Dell R210 (slated to replace athena) for $2200. Ordered a SonicWall SRA 2000 for $2300 to replace crappy VPN device. Took Paul Querna up on his offer for Cloudkick service for host/service stats. Ruediger Pluem was granted full infrastructure karma. Working out details of an Apple donation of a pair of Xserves. Brutus (issues, cwiki) got hacked. The details are available at https://blogs.apache.org/infra/entry/apache_org_04_09_2010
Philip Gollucci signed the annual service contract with Sun/Technologent for ~$2K. 2 SSD's installed into eris(svn) to boost performance. Between our EU and US svn servers we currently handle over 6M hits / day. RAM ($1200) installed into eris(svn) and brutus(jira,bugzilla) to boost performance. Website traffic to our tlp's and www.apache.org is hovering around 10M hits a day. Spam traffic continues to fall: we are currently seeing only about 600K connections per day, down from its peak of 1.5 M connections a day in 2006. Philip Gollucci worked some magic and has upgraded all of our FreeBSD boxes to 8.0-stable. The old NGROUPS_MAX problem that previously limited users to 16 unix groups is a thing of the past. Discussed and created a budget for FY 2010 worth ~$250K. In discussions to purchase an Xserve from Apple for ~$6K. Discussed an offer from a third party to host a virtual machine for us. Ultimately the offer was declined. Discussed plans for migrating Solaris zones to FreeBSD jails. Aurora (websites) is down for an extended period of time until we can determine whether or not to replace it immediately or have the machine serviced by a Sun tech. Gavin McDonald specced another dell for use as a build farm server for ~$6K. Purchased a pair of Dell 5448 48-port managed switches for ~$1600. Brad Davis of FreeBSD infrastructure subscribed to infra-private@. Aristedes Maniatis was granted infrastructure-interest karma.
Geir asked about Technologent invoice, Phil to follow up.
SVN performance improvements are much appreciated!
Turns out the SAS card we ordered for eris (svn) is totally unsupported by FreeBSD. We have sent it back to Newegg and ordered a known-to-be- supported Dell card instead, with additional cables from Provantage, at a total cost of ~$350. After several months of trouble with the nyx (vm's) raid card / array, upgrading "everything" seems to have resolved the random disk failures. Jukka Zitting has installed gerrit on git.apache.org for preliminary (semi-authorized) testing of native server-side git support. Mass mailed hundreds of committers about insecure / oversized items in their home directory. It is not encouraging to see the same people on the lists month-after-month. Philip Gollucci negotiated new terms for the Sun support contract (to be signed later this month). Tony Stevenson and Chris Rhodes engineered the successful migration of Unix and svn group data into LDAP.
Board agrees with the approach of temporarily locking out of committers who aren't paying attention to security notices.
De-racked 4 decommissioned machines: freyr, freyja, idunn, fireswamp. Introduced ckl, a communication tool for distributed sysadmin teams, into our workflow. Tony Stevenson acquired a 2 year wildcard SSL certificate from Thawte. All top-level project websites, including www.apache.org, are now available over https. Sketched out some preliminary plans for migrating the zones on helios to jails on sigyn. Replaced a failed disk in hermes(mail). Ordered a pair of SSD's and a JBOD enclosure to house them from Silicon Mechanics for ~$2400. The order is expected to ship before Feb 1 and will be used to beef up performance on eris (svn). Ordered a corresponding PCIe 2xSAS card for eris (svn) to communicate with the SSD's for ~$400. Dan Kulp was granted root on brutus to assist with confluence admin.
Sander Striker ordered a disk replacement for nike (mx1.eu) and Bart van der Schans installed it. Began testing a searchable interface for private mail archives courtesy of Chris Rhodes. Started sending out periodic notifications to users with oversized home directories, insecure permissions or private keys. Improved our DNS configuration with input from Surfnet. Ordered 6 SCSI drives to serve as replacements for various failed drives. Tony Stevenson specced an HP machine slated to replace the aging minotaur (people). Tony Stevenson expanded our LDAP footprint to now be usable for logins with all subversion repositories. In contact with Sun to replace a failed controller in helios's array. Philip Gollucci upgraded loki to FreeBSD 8.0. Philip Gollucci patched FreeBSD on minotaur(people) to deal with a few security advisories. Martin Cooper and Davanum Srinivas dealt with a mass influx of spam accounts into Confluence.
Philip Gollucci was appointed VP Infra, replacing Paul Querna. Brian Fox was added to the Infrastructure Team. A podling's distribution directory was compromised due to lax permissions (our hacker from August had installed a backdoor script in a user's public_html directory). The offending material was removed prior to being distributed to the mirrors. A general cleanup of home-dirs with lax permissions was also executed. One of our virtual hosts was hacked, most likely due to a poor choice of root passwords (and leaving remote-root-logins enabled). The virtual host was summarily nuked as a result. There was some confusion surrounding the creation of a DNS entry for the upcoming Asia Roadshow event. A general question was raised by Philip regarding how much Infra has spent of its budget so far. Advice from the Treasurer would be helpful. Sander Striker is pursuing a replacement for clarus. Infra held a face-to-face meeting at ApacheCon. Notes were posted to the infra-private list by Upayavira. Tony Stevenson has started tackling LDAP again. Chris Rhodes has been testing a service to provide web access to our private archives to members. The PDFBox TLP was created. Subversion's repo has been migrated to the ASF repo. Chris J. Davis began hacking up a new website for Apache. Gavin McDonald continued to extend our buildbot offerings. We seem to have unresolved issues with the tape library on bia (backups). We're in discussions with HP regarding a hardware donation/loan. We're in discussions with Thawte regarding SSL certificates. Two disks have failed in nyx's (virtual machines) array. Tony Stevenson has partially addressed the situation, but we need to purchase replacement drives ASAP. One disk has failed in eos's (websites, wiki) array. We're waiting for a replacement drive to be purchased. We're stalled on what to do about switch replacements. The current (24 port) switches are essentially full, and we need to purchase a switch or two with more ports to replace them.
Moin wiki post-upgrade issues resolved. Ruediger Pluem developed a patch for httpd to mitigate the ddos issue plaguing brutus (jira,bugzilla). Norman Maurer upgraded eos (wiki,websites) Solaris 10 u8. Issues with the sync scripts were resolved by Tony Stevenson. Athena (mx1.us) was brought back online by Philip Gollucci after osuosl replaced the power supply. Phillip Gollucci developed an automated system for managing crons. Paul Querna purchased a Dell Poweredge R610 for ~$4400 slated as a replacement for helios (zones). Justin Erenkrantz dealt with some IO issues on helios by shuffling a few zones around. Philip Gollucci cleaned up a few unused root accounts. (Temporarily) suspended a committer's privileges pursuant to a board request. General interest was expressed in participating in a cross-foundation infrastructure list hosted by osuosl. Gavin McDonald continued his work on the buildbot-based farm. Gavin McDonald spec'ed a Dell machine to potentially be used as an Australian svn mirror. Philip Gollucci upgraded viewvc to 1.1.2. Paul Querna ordered a fiber channel card for ~$450 to complement the ordered Dell mentioned above.
o Hacking Incident = First major security incident on our infrastructure since 2001. There are always possible things to change, but we handled it well, and have rebounded with one of the most active months in recent Infra memory. = Initial Report: <https://blogs.apache.org/infra/entry/apache_org_downtime_initial_report> = Full Report: <https://blogs.apache.org/infra/entry/apache_org_downtime_report>  http://www.apache.org/info/20010519-hack.html o SARA/SURFnet hardware moving to new location inside same data center. bvds heading up the local team on the ground. o Added Gavin as a part time contractor. o SvnPubSub developed <https://svn.apache.org/repos/infra/infrastructure/trunk/projects/svnpubsub/svnpubsub.py> = Notifies services of changes to the Subversion Repositories - Twitter Bot Online <http://twitter.com/asfcommits> - Testing SvnWcSub, to keep a working copy in sync with a master repository. Will replace /dist/ and most websites distribution in the long run, which is currently being done with rsync over SSH. o mod_allowmethods developed <https://svn.apache.org/repos/infra/infrastructure/trunk/projects/mod_allowmethods/mod_allowmethods.c> = Disabled all non-GET requests on most VHosts for *.apache.org o mod_asf_mirrorcgi developed <https://svn.apache.org/repos/infra/infrastructure/trunk/projects/mod_asf_mirrorcgi/mod_asf_mirrorcgi.c> = Hack to map our hundreds of identical download.cgi scripts to invoke the same CGI directly. o Disabled CGI support on most vhosts for *.apache.org o MoinMoin wiki upgraded from 1.3.x to 1.8.x o FastCGI via mod_fcgid is now used for the wiki and mirrors.cgi o Nyx setup with VMWare to host various VMs. o Enabled OPIE on Brutus & Nyx. o Dealt with ZFS issues on minotaur(people). Had to rebuild the array after a disk died. o Replaced hyperreal.org with no-ip.com for one of our Slave DNS Servers. o Coordinated a security fix for Bugzilla. We were contacted ahead of time by Bugzilla developers, and given a patch to apply before they made a public disclosure. Mark has since updated us to their new release. o Requested new Solaris 10 OS subscription keys from contact at Sun. o thor brought online (to host svn-dist and search services) o eos, bia, thor, and aurora upgraded to Solaris 10 u7. o New sshd_config using SSH keys stored in SVN for infra members has been deployed on most machines. o In progress of removing unneeded access & sudo to several machines (hermes, brutus, minotaur) o promoted norman,pctony,gmcdonald to root@ on all fbsd boxes o zfs is declared production ready in 8.0-RELEASE when it comes out o Minotaur(people) = Updated to 7-stable = Updated ports: - hpn-ssh - is now a port = Updated people.apache.org, www.apache.org httpd 2.2.11 -> 2.2.13 = converted to dns/bind96 = setup no-ip as dns slaves = started ipfw->pf conversion o hermes(smtp) = Updated to 7-stable o hercules(mx2.us) = Updated to 7-stable = setup o eris(svn.us) = Updated to 7-stable = updated ports - serf is now a port = updated svn 1.6.1 -> 1.6.5 = updated httpd 2.2.11 -> 2.2.13 = updated svnmailer = attempted viewc update - fixed viewvc file contents bug o harmonia(svn.eu) = Updated to 7-stable = updated ports - serf is now a port = updated svn 1.6.1 -> 1.6.5 = updated httpd 2.2.11 -> 2.2.13 = updated svnmailer = attempted viewc update - fixed viewvc file contents bug o athena(mx1.us) = Updated to 7-stable = replaced dead disk ad4 [osuosl] = replaced doa disk again [osuosl] = updated httpd 2.2.11 -> 2.2.13 = updated ports o nike(mx1.eu) = Updated to 7-stable = updated httpd 2.2.11 -> 2.2.13 = updated ports = updated lom [osuosl] o loki(tb,ftp,cold spare) = Updated to 7-stable = updated ports
Justin Mason arranged for a free license to the Spamhaus DNSBL service. Tony Stevenson and Chris Rhodes began testing phase 2 of LDAP service on harmonia (svn mirror). We lost another drive in aurora (.eu website mirror). Sander Striker is investigating a pair of replacement drives. Gavin McDonald was voted into the root@ club for his interest in FreeBSD maintenance. Purchased an r710 Dell box for VMWare hosting for ~$5K. Purchased RAM SCSI Card and additional drives from NewEgg for ~$1100. Mads Toftum upgraded aurora (.eu website mirror) to Solaris 10 5/09. Don Brown upgraded our Confluence installation to 2.10.3. Gavin McDonald is organizing the return of the four IBM x345s on loan. We continue to experience availability issues with eris (svn) due to zfs issues with FreeBSD. VMWare Workstation on odin was upgraded to 6.5.
Tony Stevenson completed phase 1 of the LDAP migration, migrating user accounts on people.apache.org into LDAP. Sander Striker promised to someday order a replacement disk for aurora (websites) and have it shipped to Bart van der Schans in the Netherlands. The SAS cable we RMA'd back to Provantage was returned back to us as an invalid RMA. We have procured an UPS shipping label from Provantage and are attempting to resend it. Infrastructure has made a request to PMC chairs to help us with Phase 2 of the LDAP migration: bringing groups into LDAP. The majority have complied, while a large number of PMC's have yet to do so. IPv6 support was disabled until we are better positioned to be able to monitor and maintain it. Henk Penning continued to keep a careful eye on the mirroring system. Brian Fox continued his support for the Nexus installation at repository.apache.org. Mark Thomas upgraded our Bugzilla instances to the latest version. Chris Rhodes was voted in as a new Infrastructure committer. Gavin McDonald continued to enhance our buildbot service at ci.apache.org.
Shipped errant SAS cable back to Provantage. 16 new backup tapes ordered, charged to ASF credit card. Began organizing volunteers for moving our gear in SARA. We still need to purchase a disk replacement for aurora (websites). Philip Gollucci upgraded all of our FreeBSD servers to 7.2-RELEASE, based on a central machine, tb.apache.org, for compiling and pushing out the software. Paul Querna upgraded subversion to 1.6, splitting out the remaining private portions of the "asf" repository into a new "infra" repository. Note to officers: the new location containing the asf-authorization file is https://svn.apache.org/repos/infra/infrastructure/trunk/subversion/authorization Sent an opt-out letter for Phorm scanning on apache.org and related domains. Discussed the lack of progress with respect to upgrading the Confluence auto-export plugin to be compatible with recent releases of Confluence. Adaptavist again claims to be nearly finished with the work (ETA 2 months), but if the situation hasn't been resolved in that timeframe we will need to pursue other options, including migrating all of cwiki.apache.org to the latest version of moin-moin. Updated the Release FAQ, with input from several contributors.
Submitted a budget to the budget committee. Purchased a 1-year extension to the Technologent service contract for our Sun gear at $4K. Set up a blogging infrastructure for projects to use at blogs.apache.org. Henri Yandell merged the Click, Cayenne, and Roller jiras into the main jira. Our Dotster account was hacked (again), the impact of which was to see our DNS glue records for apache.org changed for a brief period. Discussions to change our registrar are ongoing. Migrated our core mail server (hermes) from our last IBM x345 in service to one of our new Dell 2950's. The old gear will be deracked and sent back to IBM along with the other x345's. Set up Geographic DNS for svn.apache.org to distribute traffic between our master server (eris) and our European mirror (harmonia). Purchased a Linksys RV042 VPN device for $138. Work on the new git.apache.org continues, principally being performed by Jukka Zitting and Grzegorz Kossakowski. Work on LDAP at the ASF continues, being driven by Tony Stevenson. Work on the Buildbot installation continues, being driven by Gavin McDonald. We've set up a new mailing list for build services at email@example.com. Norman Maurer upgraded our backup server (bia) to Solaris 10u6. Lost a disk in both aurora (websites) and minotaur (people).
Replaced a failed disk in eris (svn). Discussed what to do about disused accounts on people.apache.org. No actions taken at this time. Replaced a failed disk in eos (websites). Norman Maurer rebuilt the array to be based on raidz2 instead of raidz1, and upgraded the operating system to Solaris 10u6 + patches. Gavin McDonald convinced Yahoo! to open up the buildbot server port on ceres (builds). Discussed a budget - the general consensus is that the infra budget should be $150K-$160K per year, but folks on the budget committee are looking for a more detailed breakdown of the hardware portion. Discussed an infra meetup between various open source organizations, with the intent to fund travel should it come to fruition. Norman Maurer convinced the SpamAssassin PMC to move their IO- intensive jobs to their zone on odyne. Shipped the errant SAS cable back to provantage.com. git.zones.apache.org was setup on odyne for Jukka Zitting to work on. Coordinated the move of planetapache.org to planet.apache.org/committers, with the help of several member volunteers. Henri Yandell carefully upgraded all of our Jira installations to 3.13.2. Sebastian Bazley continued his work validating foundation records. Henk Penning continued his work handling mirror requests.
Justin confirms that the $150-160k figure includes staffing
Philip Gollucci successfully upgraded people.apache.org to FreeBSD 7.1. Paul Querna purchased cables, SAS card, and an array for replacing the existing array on people.apache.org, for ~ $4500. Unfortunately the cable provider sent us the wrong cable, so we had to order a replacement. Paul Querna and Sander Striker coordinated with our datacenter providers to enable IPv6 routing for our websites. Tony Stevenson set up backup services for the Apachecon hosts. Paul Querna restructured the apache.org DNS zone to be generated from a script. Gavin McDonald continues to work on a buildbot installation on 2 of our new Yahoo! hosts. Gavin McDonald pushes for a budget for infrastructure by the end of February. Brian Fox set up a Nexus instance on the repository.zones.apache.org zone to facilitate moving some of our maven infrastructure to repository.apache.org. Paul Querna renewed the myfaces.com DNS for 2 years. Confluence is still mired in the futility of hoping someone will fix the autoexport plugin, despite many promises from Atlassian. Mads Toftum and Norman Maurer are discussing how best to provision thor for the services we will place on it. Eos (websites) lost a ZFS disk which took it out of commission for a few days. Services were moved to aurora to prevent a significant outage. Infrabot was significantly enhanced, adding many new collaboration features. It is worth noting that infrabot is now on twitter at http://twitter.com/infrabot , which we expect to use for service announcements and outreach to folks too busy to follow the infrastructure mailing lists.
Discussions about what to do with thor now that the new disks are installed continue. The new Yahoo! machines are online and being configured by Nigel Daley and Gavin McDonald with build services. Henri Yandell continues to wrestle with a Jira upgrade. Purchased a new array, cables, and SAS card for minotaur (aka people.apache.org) for $4700. Brought loki (hot-spare) online with FreeBSD 7.1. We're planning to do the same for new hermes (mail) next month. Tony Stevenson continues to lead the LDAP deployment discussions on infrastructure-dev@. Two TLP migrations, buildr and camel, were completed. Roughly 2 dozen new accounts were created. Philip Gollucci is planning to upgrade minotaur (people.apache.org) to FreeBSD 7.1 in preparation for the arrival of the forementioned array.
Paul clarified that the original intent for Thor was as a build server, but the new build machines from Y! turned out to be a better match due to bandwidth constraints and location.
6 SAS disks were purchased for use in thor: cost $2000. Discussions for repurposing thor as a database/blog server once the disks are installed are ongoing. 4 Yahoo! build-farm machines have been made available to the infrastructure team for configuration. Discussions for requesting another hired sysadmin on a part-time basis to provide centralized build services are underway. Henri Yandell continues to wrestle with a jira upgrade. Bart van der Schans visited the SARA colo again to help with maintenance on harmonia (svn mirror) and nike (mail). Paul Querna continues work on the new www.apache.org/dev/stats pages. Sander Striker and Paul Querna are pursuing an IPv6 allocation for apache hosts. Adaptivist has offered to continue James Dumay's work on the autoexport plugin (which still prevents us from upgrading Confluence). Atlassian has offered to host an SVN mirror for providing more Fisheye sites. Renewed the spamassassin.org domain for an additional 2 years. Philip Gollucci somehow figured out how to upgrade our two frontline mailservers, nike and athena, to FreeBSD-7-STABLE. Renewed the SSL certificates for svn.apache.org and issues.apache.org. Upgraded all ~800 apache mailing lists to support the emerging SRS and BATV specifications. Tony Stevenson continues to pursue LDAP deployment at the ASF. Chris J. Davis was brought on as a new infrastructure committer. Jukka Zitting was added to the infrastructure team for his work on git mirrors. The attic pmc infrastructure was set up. Three TLP migrations - abdera, qpid, and couchdb - were completed.
We discussed the importance of establishing a budget before we can evaluate the request for a sysadmin.
Yahoo! talks about a build farm are proceeding at our normal slow and steady pace. T-shirts were distributed to team members. Held a face-to-face meeting at Apachecon with irc logging. Henk Penning and Gavin McDonald led the effort to clean out the mirror system, successfully purging over 6GB of redundant artifacts. Made progress with bringing Confluence up-to-date, by first recognizing its importance to current operations and then working with James Dumay on irc to bring the autoexporter plugin up to date. We expect to make more progress once James' patch has been tested and incorporated into the autoexporter tree. No progress has been made in replacing hermes (mail). We're looking forward to tackling that when FreeBSD releases 7.1, with expected improvements in its zfs support. Norman Maurer is still waiting for us to order disk replacements for thor (builds). Took a few failed cracks at upgrading nike (mail) to FreeBSD 7 with Bart van der Schans' on-site help. Will continue tackling this problem before pursuing FreeBSD upgrades to other x2200's in service. Offered Chris Davis committership under infrastructure's purview. He has accepted the offer. Tony Stevenson-led investigations into deploying LDAP at the ASF have gained some speed. We've talked with the Apache Directory project a bit about potentially using their software for this purpose. Paul Querna provided the Syracuse researchers with a filtered dump of our public svn tree, and will be coordinating with them to determine the best way of keeping their soon-to-be-published copy of our repo current. We also pointed them at the rsync location of our raw mail archives. Justin Erenkrantz purchased a few new SSL certs to replace those that were scheduled to expire next month.
Paul notes that we may need another part time administrator next year. Don't feel comfortable tasking that to our existing administrators as this requires root.
The infrastructure team is looking into Covalent's tools as time permits.
Yahoo! talks about a build farm are ongoing. No progress has been made in replacing hermes (mail). We're looking forward to tackling that when FreeBSD releases 7.1, with expected improvements in its zfs support. Fail2Ban, a tool for guarding against ssh scans, has been implemented on people.apache.org by Tony Stevenson. Some issues with Confluence, particularly its license, have arisen. We are currently stuck between a rock and a hard place in that we cannot upgrade it without breaking the autoexport plugin, which is a core feature of the service. Disk replacements for thor (builds) have not been ordered, pending a callback from CDW. Mark Thomas was granted infra karma for his work on Bugzilla. Discussions about creating a vmware instance for Windows are ongoing. Our svn servers are now servicing over 2M requests per day, which is a doubling of activity over the past year. Our automated checks against wiki spam have blocked roughly 100K attempts over the past year. Jukka Zitting continues his impressive work experimenting with git at Apache on infrastructure-dev@.
We've fallen a bit behind our machine upgrade schedule due mainly to persistent concerns over the stability of FreeBSD on eris (svn). The machines that need to be brought online are loki (a cold spare) and hermes (a drop-in replacement for the existing x345 which serves mail). Dealing with intermittent problems with the build process on the hudson zone. Transferred the vmsa vmware instance to a zone on odyne for performance reasons. Created roughly 2 dozen new committer accounts. Sebastian Bazley continues his work rationalizing foundation records and authoring supporting scripts. An issue came up regarding the ability, or lack thereof, of purging sensitive data from the svn repository. No action was taken at this time. Yahoo! has been in touch with us to resume talks about a build farm donation.
Paul continues to work the issue on the incompatible drive trays.
We have a functional backup system in place, complete with backups of select files within user home dirs, thanks to Tony Stevenson, Gavin McDonald, Norman Maurer, and Roy T. Fielding. Philip Gollucci was flown down to Fort Lauderdale to help set up the new colo site at TRACI.net. The two Geronimo build machines, selene and phoebe, were set up and handed off to the Geronimo PMC. An experimental LDAP zone was set up to pursue the idea of deploying LDAP in some capacity at the ASF. Purchased a SCSI card for thor (build zones). Unfortunately the existing array failed miserably (both PSU's died). Currently pursuing a different path of installing drives in thor (said drives were also purchased from Sun for ~$2000, but need to be replaced due to incompatible drive trays.) Wendy Smoak was granted infrastructure karma for her work on the ASF's maven repository. The maven snapshot repository was purged of all files older than 30 days, which created some ripples within the community. At its largest it was over 90GB, which means it contained more bits than archive.apache.org. It currently stands at 21GB.
Henning will work with Geronimo folks to start load monitoring on the new machines provided to their projects.
Purchased two dell 1950s for $10K. The machines have been shipped to the sysadmin and will be deployed shortly for Geronimo to use for TCK testing. We investigated EC2 as an alternative, but found it wasn't cost-effective for the needs of the Geronimo PMC. Entered into a monthly agreement with traci.net for ~$500 / month for colocation services in Florida. Ordered a switch and miscellaneous cables for setting up the colo. Helped some new zone admins set up shop. Set up the Sun t5220 as thor, which will be used for build systems once we have the disk situation sorted out. Work continues on setting up bia (backups), driven by Tony Stevenson, Gavin McDonald, and Norman Maurer. Set up odyne in SARA, which will be used as a zone host. Made progress with Jason van Zyl regarding the adoption of maven.org and the central repo machine by the ASF.
WHEREAS, the Board of Directors heretofore has charged the President with the responsibility of overseeing the activities of the ad-hoc Infrastructure Committee using the President's existing authority to enter into contracts and expend foundation funds for infrastructure, and WHEREAS, the Board of Directors recognizes Paul Querna as the appropriate individual to chair the infrastructure committee, with respect to executing the board approved infrastructure plan binding the Foundation to infrastructure contracts and associated financial obligations, NOW, THEREFORE, BE IT RESOLVED, that Paul Querna be and hereby is appointed to the office of Vice President, Apache Infrastructure, to serve in accordance with and subject to the direction of the President and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. Special Order 7H, Appointment of Infrastructure Committee Chair, was approved by Unanimous Vote of the directors present.
Addressed the board's suggestion for more descriptive service names by updating our nagios config and improving the documentation in dev/machines.html. Work continues on setting up bia (backups), driven by Norman Maurer, Tony Stevenson and Gavin McDonald. Gaea (zones) was determined to have BIOS issues. As mentioned in the previous report, it was shutting itself down without warning. A ticket was filed with Sun, and remains open. The problem hasn't occurred since upgrading the BIOS, but we are carefully monitoring the machine. Henk Penning continues to keep a close eye on the operation of the rsync mirrors and their signed contents. We purchased a NIC card and 8 GB of RAM. The NIC card is insurance against a failing NIC in eris (svn), and the 8 GB of RAM was divided between gaea and hyperion (zones). Several zones were created, the tuscany TLP was migrated, and roughly 30 new accounts were created. We are working with OSUOSL to get new hermes (mail) online and a new Sun 5220 racked. Eos and aurora (websites) had system upgrades performed by Mads Toftum and Norman Maurer. Norman Maurer was granted root karma on people.apache.org. Gavin McDonald was granted apmail karma. The general availability of svn.eu.apache.org was announced on committers@. Upayavira volunteered to represent the infra team at an OSS Watch workshop on profiling open source communities.
We purchased 3 Dell 2950s costing about $12K total. We purchased a certificate for svn.eu.apache.org, intended for use as an svn mirror of our main repos. We are in the final testing stages now, and expect to bring this machine into community service in the very near future. Sun donated 8 SAS drives and we had them shipped to Sander Striker. The drives will eventually be installed in odyne, the 4150 we cannibalized to get harmonia up. Sun has offered to provide us with a support contract for our machines in SARA, at no charge to the ASF. Details have yet to be finalized. Old eris has retired itself due to its failing RAID array, and one of the 2950s was pressed into service via trial by fire. A tremendous amount of effort went into bringing the replacement box online and stabilizing it- chiefly by Philip Gollucci, Norman Maurer, Paul Querna, and Tony Stevenson. Work continues on setting up bia, our backup host, driven by Norman Maurer and Tony Stevenson. Roy T. Fielding made significant improvements to our ezmlm installation, eliminating the need for moderation on our commit lists. Sebastian Bazley has done an incredible amount of work rationalizing various foundation records. Roughly 3 dozen new accounts were created this month. 2 TLP's, archiva and cxf, have been migrated. In light of the recent Debian/Ubuntu security advisory (CVE-2008-0166) regarding ssh/ssl, we have upgraded the host keys on all of our Ubuntu hosts, and have scanned all the public keys on people.apache.org. We investigated the ssl certificate on brutus and found that it predates the vulnerability. 4 committers' and 2 members' accounts had their public ssh keys disabled on people.apache.org for failing to comply with a request from root@ to remove them within 48 hours of being notified. The keys in question were all detected by the dowkd.pl script, and most users (~30 total) who received the notice dutifully complied. Gaea has started shutting itself down occasionally for reasons unknown. We are investigating, but so far there's been little information to go on in the logfiles.
It was noted that a map of host names to services can be found at
Concerns about host names are deferred to the infrastructure team.
Purchased a 1y silver-level Sun support contract from Technologent regarding OSUOSL-located Sun equipment. We had the system board and a DIMM in eos replaced under the aforementioned support contract. Work continues on setting up bia, our backup host, mainly driven by Tony Stevenson and Norman Maurer. Both of them have been granted apmail karma. Roughly 2 dozen new account requests were processed. Sun graciously donated a pair of x4150s for use at SARA. We have brought one of them online as harmonia, and will be pressing it into service as an svn mirror site. A confluence-backed website was hacked into due to an improper permissions scheme. As confluence typically does not provide change notifications, no record of the event was sent to the affected project's mailing lists. It was later determined by David Blevins that roughly 30 other confluence spaces were also misconfigured, further reinforcing the opinion of many infrastructure members that the confluence installation at the ASF is a bridge to nowhere. Wendy Smoak continues her excellent work reviewing changes to the Apache maven repos on repository@. Gavin McDonald, Norman Maurer, and Tony Stevenson were all granted full infrastructure karma.
Pressed new brutus machine into production, shut down old brutus. Brought bia (the new backup machine) online. Work continues on setting up the actual backups, mainly driven by Tony Stevenson and Norman Maurer. Experiencing some problems with jira's availability on new brutus. Jeff Turner is looking into it. Migrated bugzilla to 3.0. Special thanks to Mark Thomas and Sander Temme for all the hard work that went into that process. TLP migration for continuum was completed. Roughly 2 dozen new account requests were processed. infrastructure-dev@ was made a public list at Jukka Zitting's request.
Acquisition strategy through May 2009; bottom line figure is $58,900. Key features: - Replace our aging IBM x345s which are currently the cornerstone of our infrastructure - they are nearing 4 years old. - Stagger replacements so as not to do it all in one go - Gets a SVN mirror in EU - Equip a respectable build farm - Equip for CMS 'thing-ma-bob' with staging + 2 prod servers The specifics of the machines may change, but this is an overall plan. Acquisition strategy: - OSL: Stay relatively power-neutral for next 4-6 mos; expand after then - SARA: Expand to 20U 'early next year' (pushing for Feb.) - x345s: Acquired in late 2003 / hermes prod in Feb. 2004 - Helios: In-service approx. April 2005 Decision: Base configuration: stick with x2200M2 with SATA drives --- Base equipment costs [as of 10/28/2007]: Machine: - x2200 M2 - 1x 2210 / 2GB / No drives: $1619/ea (incl. tax+shipping) - x2200 M2 - 2x 2218 / 8GB / No drives: $3871/ea (incl. tax+shipping) - x4150 - 1x Intel E5320 (1.86GHz; Quad) / 2GB / No HDD: $3082/ea (incl. t&s) Machine extras: - CPU: AMD Opteron 2210 (1.8GHz): $179/ea from Newegg CPU: AMD Opteron 2218 (2.6GHz): $455/ea from Newegg - RAM: DDR2 PC2-5300 / CL=5 / Registered / ECC / DDR2-667 / 1.8V 2GB sticks from Crucial: $135/ea [Buy in pairs] 8GB (4x2GB) -> $600 (incl. tax) [$581.81] 16GB (8x2GB) -> $1200 32GB (16x2GB)-> $2400 Storage: - Hitachi A7K1000 750GB SATA drive: $230/ea from Newegg - Seagate ES.2 1TB SATA drive: $339.99/ea from Newegg - Factor $250 for 750GB; $350 for 1TB Derived cost for manual upgrade of base x2200 config: - 2x 2210 / 10GB / 2x 750GB -> $2919/ea ($3000/ea) - 2x 2210 / 10GB / 2x 1TB -> $3119/ea ($3200/ea) 2nd-level x2200 config: - 2x 2218 / 8GB / 2x 750GB -> $4371/ea ($4400/ea) 3rd-level x2200 config: - 2x 2218 / 8GB / 2x 1TB -> $4571/ea ($4600/ea) - 2x 2218 / 16GB / 2x 1TB -> $5171/ea ($5200/ea) - 2x 2218 / 32GB / 2x 1TB -> $6371/ea ($6400/ea) x4150 config: - 1x E5320 / 4GB / 6x 750GB -> $4982/ea ($5000/ea) --- Helios [Solaris zones]: 13x 750GB SATA drives = $3250 Battery backup replacement (Jan-Feb): $450 http://www.memoryxsun.com/3705545bat.html [370-5545-BAT] Conditional: Needs correct braces from Sun; ETA next week @ OSL. Purchase: December In-service: January Helios array total: $3700 --- Brutus [Issues: JIRA/Bugzilla/Confluence]: x2200 M2: 2x 2218s (4 cores) / 8GB RAM / 2x 1TB SATA / Linux x86_64 [ Atlassian will support via official Sun x86_64 JVM ] Purchase: December In-service: January @ OSL Expected price: $4600 --- SVN mirror @ SARA: x4150: 1 Quad-Core Xeon / 4GB RAM / 6x 750GB SATA / FreeBSD Conditional on: x4150 being available with SATA drives Purchase: Late Dec / Early January In-service: February @ SARA Expected price: $5000 [SARA box needs to be purchased thru Sun .NL; so may be more if in EUR.] -- Eris replacement [SVN @ OSL]: x4150: 1 Quad-Core Xeon / 4GB RAM / 6x 750GB SATA / FreeBSD Purchase: March (after SVN mirror @ SARA setup) In-service: April Expected price: $5000 -- Loki [cold-spare @ OSL]: x2200 M2: 2x 2210s (2 cores) / 10GB RAM / 2x 750GB SATA / FreeBSD Purchase: May In-service: June Expected price: $3000 -- Hermes [mail @ OSL]: x2200 M2: 2x 2210s (2 cores) / 10GB RAM / 2x 750TB SATA / FreeBSD Purchase: July In-service: August-September Expected price: $3000 -- Build farm (@ OSL) - stage 1: (2) x2200 M2: 2x 2210s (4 cores) / 16GB RAM / 2x 1TB SATA / Linux (VMWare Wks) Purchase: September In-service: October-November Expected price: $11,400 (2 @ $5200) -- Build farm (@ OSL) - stage 2: Apple xServe: 2x Dual-Core Intel Xeon / 1GB RAM / 80GB SATA Purchase: January '09 In-service: February '09 Expected price: $3200 ($2,999 + tax&shipping) -- CMS thing-ma-bob's: Staging @ OSL: x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux Prod @ OSL: x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux Prod @ SARA: x2200 M2: 2x 2210s (4 cores) / 32GB RAM / 2x 1TB SATA / Linux Purchase: February '09 In-service: March-May '09 Expected price: $20,000 (3x$6400) [SARA box needs to be purchased thru Sun .NL; so may be more if in EUR.]
Approved by General Consent
Tabled due to time constraints.
Sander provided an overview of the current state of the ASF Infrastructure, as summarized in the President's report above.
Approved by General Consent.
No report received or submitted. Sander to be contacted regarding status.
Sander submitted an oral report, noting that Infrastructure was very busy. Leo Simons sent an Email to committers@ asking for volunteers to help.
Approved by General Consent.
During ApacheCon Martin Kraemer and David Reid have made significant progress on the ASF Certificate Authority. There are high hopes to have the full system (including authentication for version control, mailing list archives, etc), before next ApacheCon. A call for volunteers for the Search Committee for an external sys admin was issued. The search committee is to be responsible for: - finding/soliciting candidates, by whatever means; - selecting candidates; - reporting progress of the search to the Boad; - and finally, recommending candidates to the Board The search will go on until the search committee is satisfied it has seen enough viable candidates. Responses to the call for volunteers have been sparse, but there are four volunteers to serve on the committee: Erik Abele David Reid Scott Sanders Sander Temme Justin Erenkrantz has stepped back from Infrastructure, due to time constraints. Given the amount of energy he has put into Infrastructure over time, I at least owe him a thank you here. Before ApacheCon EU, during a mini-Infrathon, and during ApacheCon EU the Infrastructure team has addressed the situation which caused the version control tool to wedge. It is therefor believed the final argument against migrating to Subversion has been overcome. For the rest of the news from Infrastructure I am going to refer to a mail sent out by Leo Simons to all committers. Message-ID: <BF01611C.4F8C0firstname.lastname@example.org>
See Attachment L
April-June The infrastructure team has been so busy it hurts. We have migrated a few more projects to top level, migrated a few from cvs to svn, added some new infrastructure projects, users, "the usual". We are seeing roughly 600 emails a month on the infrastructure mailing list, excluding svn commit messages. Besides the usual, some things of note include that Nontechnical ------------ * we have slowly gotten to work on the board's request to formulate RFPs on paid staff/outsourcing. * OSU OSL has generously offered to provide us some hardware along with hosting at their colo. * we have set up a new mailing list, email@example.com, which is tasked with figuring out a flexible and powerful new website publishing process. * we have sollicited volunteers which has yielded roughly half a dozen responses from previously silent people as well as somewhat inactive infrastructure participants offering to become more active. * we have found we are not currently in optimal shape for productively growing our team and getting people things to do. Work is in progress to improve that. * quite a bit of work has been done and is still in progress on internal and project-oriented documentation. * we have some promising submissions to the google summer of code programme for helping with the asf-workflow tool. * we are planning another infra-thon (infrastructure team gettogether) in the weekend leading up to apachecon which should be considerably more modest and hence cheaper than the last one. Technical --------- * our mailserver has had a lot of trouble handling the load (mostly spam) lately. Solutions being explored include optimizing the machine's configuration, patching the mail software to be more efficient, taking into operation a new machine at OSU OSL, and generally anything else we can think of. It has taken a lot of time just to keep things running, and we have seen some service interruptions. We may need more hardware if the amount of spam that goes around the web continues to grow. * brutus has been partially reinstalled to serve as another FreeBSD host, which hopefully will be finished this week. We will use it to take over some of minotaur (our main box) its duties as we upgrade that. * AMD has donated a machine for running apachecon.com which we will be hosting in our rack in the San Francisco colo. The machine is scheduled for install this week. * loki (running vmware) has been partially configured for gump runs. * we have another machine on loan at OSU OSL for gump runs which is not in operation yet. * quite a few (dozen??) of zones have been set up on the new sun box, helios. Various PMCs have been busy setting a variety of services up. Helios is still in a testing phase. * we are hoping to take the raid array that came with helios operational this week. * we experienced a lot of performance problems with the wiki which were solved by putting in place a development install of apache httpd with mod_cache. * JIRA has been upgraded to the most recent release. It is still being tuned to be as stable as it was before. Users are seeing a notable performance increase and fancy new features such as automated links to subversion changes. * Our SVN service had a few hickups, but incident rate seems decreasing whereas usage is still increasing. The majority of our projects are now using SVN, with more projects in the migration queue. * the certificate service (ca.apache.org) has seen a lot of development work recently. * we have moved DNS registrars. We are now with dotster?? * Serge's Nagios installed has been moved to monitoring.apache.org and reconfigured to provide even more useful information. * We have purchased and installed another PDU at our main colo in SF
The Infrastructure Team has gathered for an Infra-thon from March 18th (Fri) til March 22nd (Tue). A travel/lodging budget of roughly $3500 was spent. The final number is not available at this time. All machines based at Mission St. have been move to Paul, the new location of the UnitedLayer colo. Three passes have been requested for physical entry to the colo, for: Brian Behlendorf, Scott Sanders and Sander Temme. Some issues have arrised during the infrathon, but none have cause severe downtime or unacceptable discomfort to our projects. Due to an error on part of the team we have now rougly figured out how much bandwidth we would use without the use of mirrors; rougly 70Mbps. The power distributors bought as approved by the board have not been used. It turned out that we can actually use the 0U PDU that we already have, and, there is room for a second one. We'll have to return the unused PDUs; Mirapath is working on a quote for a new 0U PDU. We are planning on moving all shell accounts to elsewhere at some point. This will be hosted under people.apache.org. DNS has already been updated, meaning our committers have to use people.apache.org to log into their shell. Also, we managed to get hermes (our primary mailserver) failing every 7-8 hours. This problem seems to have disappeared after a BIOS update and/or a kernel update. Brutus, the current Gump machine, can remain under supervision of the Gump PMC until we are done setting up the new Sun server, helios, and loki, our machine purposed for running VMWare. After that, brutus will most likely become our secondary mailserver. The plan was will be shipped to The Netherlands, where it can do its work, and serve as a fail-over in case of failure of hermes. However, due to external conditions that are applicable to the IBM machines we cannot ship any of them out of the US at this point. Minotaur, our machine usually serving our websites, subversion and shell accounts was relieved of serving the websites prior to the colo move. We've left this to be the case because we wish to upgrade the OS on that machine in the near future. We are synchronizing content to ajax every 4 hours. We were planning to upgrade minotaur to FreeBSD 5.3. Unfortunately there were some setbacks that prevented us from doing a backup. We weren't feeling very lucky, and have decided to put off the upgrade til a later point in time. Ajax, our european hosted machine, is currently hosting most of the websites (with the exception of tcl and perl), as well as the wiki, jira and bugzilla. We've switched off the indexing by search engines for the wiki and the issue trackers since the load on the machine was insanely high. Ajax too is scheduled for an OS upgrade, due to the fact that is is stuck in IO wait half the time. We will however not be doing this while most of our infrastructure is hosted by that machine. It has held up pretty nicely, even when we at some point were not using the mirrors and it was handling most of our traffic (we peaked at 50Mbps). Loki is going to be our machine hosting VMWare. We've added 2GB of memory and 2 36GB disks from one of the other machines, giving it enough beef to actually run the ESX installation. It's primary purpose is going to be to host various OS instances for Gump to run on. Helios is our new Sun v40z. It came with a 1TB StorEdge RAID array. Unfortunately it was delivered without an OS installed, no install media, and, the FibreChannel card to connect the StorEdge to the machine was missing. This will all be resolved, but for now we are not able to swiftly put this box into production. However, when put in production this machine will host several different so called zones. One of these zones will be people.apache.org, which will host all current shell accounts. Every PMC will most likely also get their own zone, which it can use as a testing ground/showcase for their own software. Finally Gump will be given one or two zones for their runs. Eris, our final machine, is stripped down to just the chassis pending hardware to refit it. The wiki has seen an upgrade, which was quite an undertaking. Preparation was done for the Bugzilla upgrade. The final upgrade will be done at a later stage. Work is progressing on the Eyebrowse to mod_mbox migration, preserving the Eyebrowse urls. Subversion is doing fairly well. We've seen a number of repository wedges, which seems to have nothing to do with the core functionality of Subversion, but rather has to do with the add-on functionality provided by ViewCVS. We are aiming to cure the symptoms, since we have not been able to pinpoint the cause, by moving our Subversion backend from fs_bdb to fsfs. All in all the infrathon has proven to be a valuable experiment. For the future I personally would consider limiting infrathons to software work only, defering all work involving hardware to the locals. The high bandwidth communication as well as the near to full time availability of the entire team has proven to be incredibly useful. That said, the focus for our services will have to shift at some point to integration and coupling as currently we are growing more and more islands that in consequence require a relatively large support team. Reducing complexity for our users as well as the Infrastructure team is definitely something to consider.
Infrastructure Team report approved as submitted by general consent.
We're transitioning from nagoya to ajax. We've submitted the final H/W list to Sam for IBM, and are waiting to hear back from him. We're going to proceed with the purchase of a UPS for our co-lo rack shortly. We are also coordinating an infrastructure get-together in SF for Q1 2005 so that we can address the pressing large-scale items on our plate with as many people in the same room as possible and near our servers to coordinate server hardware and software upgrades. Financial assistance to help bring the participants together is desired.
Apache Infrastructure report approved as submitted by general consent.
6. Special Orders
7. Discussion Items
Infrastructure benefits a lot from the face to face meetings at ApacheCon. Work is getting done. David Reid and Ben Laurie have been working on the CA and things are looking very promissing. We will be able to offload all the adding and removing people from groups. This will require that all services are exposed via HTTP, so there is a lot work that needs to be done. With the moving of several projects to SVN, this goal will be easier to reach, given that shell accounts won't be needed anymore to do actual development. People do actually volunteer, but it is hard to actually parallellize a lot of the tasks given the centralized knowledge. Services are being moved around and off nagoya, since the machine is being retired. Infrastructure is budgetting $5k for the acquisition of a UPS for the US colo.
Sander reported that Infrastructure is finding itself steadily overworked, as well as there being confusion over how much authority Infrastructure has; he pointed to the mirroring policy as a prime example. He was happy to report that additional volunteers have joined, especially Roy.
The Infrastructure team is battling the same problems as reported before. Infrastructure needs help to get things done. The root rotation doens't get filled. An obvious way out is hiring a (parttime) sysadmin; so that the team can focus on getting automation tools developed making the job less involved. That said, Ken Coar offered help in writing mailing list management tools. Also Geir offered to help out with the mailing lists. And we are happy to announce that Berin Lautenbach has joined the group of roots. Hardware wise we gained a switch contributed by Theo van Dinter. We are waiting for that to arrive. Ajax still doesn't have console access, nor do we have the accounts on the power switches to power cycle the box. SurfNet also asked us to work out the reverse DNS on our end, which we haven't gotten around to yet. DNS is under the Infrastructure's team control again, and we are happy to announce that we added to more secondaries, making our DNS servers a bit more globally spread. A lot of work has been done getting VMware instances up and running and this seems to pay off. Investigation in applicability is ongoing. Services hosted on nagoya will be moved off to other boxes, given the current (in)stability of some of the services on nagoya. eris, one of the new IBM x345s seems to be having some problems. Investigation is ongoing.
One issue - one of the new IBM boxes not functioning correctly, and budget is needed for switch and UPS, but still working through so no current action items.
Sander gave a verbal infrastructure report. Recent events have included minatour disk problems; new disks have been sourced. The new machines are now in the racks and are currently being tested. There was some discussion with United Layers over power consumption, but this has been resolved without requiring any action.
[ from Brian Behlendorf ] Major efforts: New ASF server was paid for by ASF check and picked up by Brian. Brian to set up an in-person meeting for the local ops team to check out the box, learn how the RAID works, etc. To be scheduled for some time next week most likely. The colocation agreement with United Layer was signed. I'll send a copy to Jim for our records. We can move in any time, billing starts a few weeks after we move in our first box. First move might happen this week if I get my act together (Brian speaking). Other work done by the team: Created 22 accounts: jesse, antoine, andreas, alexmcl, egli, gregor, michi, edith, felix, liyanage, memo, thorsten, minchau, sterling, psmith, sdeboy, brudav, michal, cchew, yoavs, funkman, joerg Removed 1 account: mehran Updated the account creation template in the infrastructure repository, actfrmtmpl.txt Dealt with a couple out-of-disk-space situations Dealt with some runaway robots Discussed whether to change the commit-mailer to only send viewcvs URLs if the commit message would otherwise be too big. Updated the Subversion installation on icarus Fixed a content-encoding issue on the web server, and turned off SSI Updated search.apache.org Also of note: A demo of SourceCast for apache.org has been set up, and I've shared access to it for a few folks. Most of them are busy, though, so if others would like to give it a try, write me privately. I'll be putting together a true plan for ASF evaluation over time, this is just an initial kick-the-tires kind of thing.
The following resolution was proposed: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish an ASF Board Committee charged with maintaining the general computing infrastructure of the ASF. NOW, THEREFORE, BE IT RESOLVED, that an ASF Board Committee, to be known as the "Apache Infrastructure Team", be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Infrastructure Team be and hereby is responsible for creating and upholding the computing policy for the Foundation; and be it further RESOLVED, that the Apache Infrastructure Team is charged with managing and maintaining the infrastructure resources of the Foundation; and be it further RESOLVED, that the Apache Infrastructure Team is charged with accepting infrastructure resource donations to the Foundation; and be it further RESOLVED, that the Apache Infrastructure Team is responsible for handling communication and coordination in relation to infrastructural issues; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Infrastructure Team: Brian Behlendorf (chair) Justin Erenkrantz Pier Paolo Fumagalli Ask Bjoern Hansen Aram Mirzadeh Steven Noels David Reid Sander Striker Discussion on this resolution focused around the need for such a Board Committee. Roy Fielding noted that such a committee might be best handled as a President's Committee since the President, rather than the Board, is in charge of operational aspects of the ASF. It was further discussed that such a team would be a good idea to create a focal point for long term initiatives, as a content point, and to create a sense of empowerment for the people interested in the technical infrastructure of the ASF. By general consent, this resolution was tabled, with a recommendation to the President to establish a President's Committee with the same goals and responsibilities.