[Group] Google Technical Infrastructure Resource Management

Functionality

  • Responsible for the efficient utilization for all the Google data centers through software, services, and business operations
  • Develop management software for all Google products (e.g., Google Cloud, Gmail, Youtube, Maps, Android, Search, Ads, etc) to manage their machine, compute, and storage needs. 

Focus

  • Cloud resource distribution
  • Efficient management
  • Product catalog
  • Pricing automation
  • Demand planning
  • Fleet planning
  • Ordering
  • Cost allocation
  • Usage collection
  • Financial policy control
  • Data warehouse
  • Reporting

Questions

    • How can I contribute as an intern?

    Questions about Borg

    • Scalability
      • How the nodes are added and removed?
    • How Borg handles with failure
      • In page 2, it said it will restarts the tasks if fail. Will this cause some errors, e.g., some operations are conducted more than once. Should user’s program handle the failure through roll-back by themselves?
      • Security
        • Will it be possible that there are some malicious jobs that try to interrupts other jobs, or breach the system?
        • Borg monitors the health of the jobs.
          • But it seems the health is in the response time etc.
      • Migration
        • Scheduling of tasks is very critical since there is not virtualization of physical machines
          • It seems job migration is not possible?
      • Over-selling
        • As users may over-purchase the resources, the Borg use over-selling approach
          • What if the over-selling cannot be satisfied. e.g., United Airline over-sell the tickets, will it cause trouble?
          • Or it seems the over-sell is only for low priority?
          • How to predict the future resource usage for the other Google groups, so that they can purchase the Quota as accurate as they can?
        • Users pay for what they use? or for what they reserved?
      • The idea of simulator of Borgmaster is cool
        • To decide whether a configuration changes will evict any important jobs
      • Fake-estimation in order to be scheduled earlier
        • Purposely stating less CPU needs in order to scheduled earlier
      • Will the user specify the running time of the job?
        • How will a false estimation that affects the schedule?
      • How Borg is different from Google Compute Engine
        • Can Borg be replaced by GCE

      Questions Answered by Myself

      • Q: As all Google groups pay for the resource they use, is the price a fixed value, or does it has incentive to motivate people to make better utilization of the resources
        • For example, a more urgent request for the machine is more expensive. This can motivate people to make plan for their machine usage. 
        • A: Yes. Jobs are of different priority, jobs with higher priority are more expensive.