Originally Posted by ib-dick
My name's Dick, and I'm the project manager for TrekEarth. I'm not new here. In fact, I've been working on TrekEarth since March of 2009-- before the redesign, when Adam was handling all of the tech for the site. I worked with Adam on the relaunch, the migration to the new forum, and just about everything since.
Before I address the recent tech problems, I'd like to address my absence on the forums. Back when Adam was around, my presence wasn't really needed, as the community knew Adam and he did a great job communicating with you. When Adam left, Steph did an amazing job of stepping in engaging the community. We decided that our time was better spent with me managing the tech side of the site, and her communicating the issues to me.
While Amber continues to do an amazing job of being a conduit between me and the community, I believe the recent tech issues warrant a response directly from the tech team, hence this post.
When the site was migrated to our hosting (a long time ago), it was set up using an architecture mirroring the old host. While this architecture isn't terrible, it has some limitations and is not our standard configuration. We have had a plan to move this to a more modern configuration for some time, but have felt our efforts were better spent fixing bugs. Over the last year, we've made preparations, but stayed short of an actual migration to new hardware and updated software.
Because of this outdated and unfamiliar environment, we've had considerable difficulty working with the code and been limited in our ability to mitigate problems regarding scalability. Many of the bugs that have appeared over the last year have been due to these issues. It seems like every time we fix something, two other things break-- two more things that are, once again, difficult to track down because of the current technology.
One specific problem that we were having was with the database. It ran an outdated version of mysql that our developers are less familiar with. Many of our performance enhancements don't work on that version, so when the site becomes slow, we aren't able to address the issues as quickly as we'd like.
As to the most recent database errors, I'm sorry for the problems. There was a poorly written query that had been running on the site for a long, long time. It had always been a problem, but last weekend it reached its head. when it eventually hit the breaking point, and started crashing the site. With the outdated version of mysql, we couldn't use our standard diagnostic tools to track it down. Since the site was constantly crashing, I made the decision to migrate the database to a new server with updated software.
While this is usually not a monumental task, this specific migration posed significant challenges. First, since the site was crashing, we had to make the migration on short notice. This means that we didn't have the time to do our usual preparations and planning. Furthermore, we didn't have time to do a test upgrade to look for errors. We were forced to do all of this on the fly, and like anytime you operate in this manner you'll have problems. Some of these problems, are still being actively addressed by me and my developers.
One of the oversights was the beta side of the site. When we switched the www side of the site over to the new database, we didn't realize that the database server's IP was specified in a different manner on the beta side of the site. This means that, for a period, the beta side of the site was using the old database. When this was corrected, all of the content that was added to the site through the beta interface was lost. I apologize for this. Furthermore, this afternoon we realized that the beta version of the forum was still using the old database. Again, when we corrected this, we lost all content that had been added through the beta side of the forum.
It is because of this added complexity that we never intended on maintaining the beta version of the site. Still, it was absolutely not our intention for these events to happen. We did not purposefully neglect the beta side of the site. We were simply reacting the best we could to a bad situation.
The good news is that with the exception of a few lingering issues, we are now using a more modern version of the database software. This has not only allowed us to fix the problematic query, it has allowed us to fix other problematic queries and perform some much needed optimizations. From the database side of things, we're in a much more stable condition.
From the web server side of things, we're still in a less than optimal situation. We don't have the ability with the current configuration to scale horizontally, which is what we need to do. The recent problems have demonstrated that we need to increase the prioritization of this, and this time we will have the time to properly plan this out.
Over the next week, we will be making a number of changes to the application servers to the site. We will be adding more power and redundancy to the system. During this period, we might have some outages. I intend to give fair warning here on the forums when we expect these outages, and we will do our best to minimize them. We understand that it's been a rough ride lately, and we don't want to make it any worse. We are working to make things better.
I would like to apologize for the inconveniences that the recent tech issues have caused, and we appreciate your patience while we continue to work through them.
Project Manager -- Internet Brands