gpcdesre/Part_4.md
2021-02-03 16:54:24 +00:00

9.5 KiB

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)

Introduction

About the Course and Learning Path

Make better software, faster

Milestone: Getting Started

Understanding Operations in Context

Section Introduction

What Is Ops?

  • GCP Defined

    • "Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"

    • Logging Management

      • Gather Logs, metrics and traces everywhere
        • Audit, Platform, User logs
          • Export, Discard, Ingest
    • Error Reporting

      • So much data, How do you pick out the important indicators?
        • A centralized error management interface that shows current & past errors
        • Identify your app's top or new errors at a glance, in a dedicated dashboard
        • Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
    • Across-the-board Monitoring

      • Dashboards for built-in and customizable visualisations
        • Monitoring Features
          • Visual Dashboards
      • Health Monitoring
        • Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
      • Service Monitoring
        • Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
    • SRE Tracking

      • Monitoring is critical for SRE
        • Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
          • Pinpoint SLI's and develop and SLO on top of it
    • Operational Management

      • Debugging
        • Inspects the state of your application at any code location in production without stopping or slowing down requests
      • Latency Management
        • Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
      • Performance Management
        • Offers continuous profiling of resource consumption in your production applications along with cost management
      • Security Management
        • With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud

What is Ops: Key Takeaways

  • Ops Defined: Watch, learn and fix
  • Primary services: Monitoring and Logging
  • Monitoring dashboads for all metrics, including health and services (SLOs)
  • Logs can be exported, discarded, or ingested
  • SRE depends on ops
  • Error alerting pinpoints problems, quickly

Scratch:

  • Metric query and tracing analysis
  • Establish performance and reliability indicators
  • Trigger alerts and error reporting
  • Logging Features
  • Error Reporting
  • SRE Tracking (SLI/SLO)
  • Performance Management

Clarifying the Stackdriver/Operations Connection

  • 2012 - Stackdriver Created
  • 2014 - Stackdriver Acquired by Google
  • 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
  • 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name

Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs (formerly all called Stackdriver )

"StackDriver" lives on - in the exam only

Integration + Upgrades

  • Complete UI Integrations
    • All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
  • Dashboard API
    • New API added to allow creation and sharing of dashoards across projects
  • Log Retention Increased
    • Logs can now be retained for up to 10 years and you have control over the time specified
  • Metrics Enhancement
    • In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
  • Advanced Alert Routing
    • Alerts can now be routed to independent systems that support Cloud Pub/Sub

Operations and SRE: How Do They Relate?

  • Lots of questions in Exam on SRE

What is SRE? - "SRE is what happens when a software engineer is tasked with what used to be called operations" (Founder Google SRE Team)

Pillars of DevOps

  • Accept failure as normal:

    • Try to anticipate, but...
    • Incidents bound to occur
    • Failures help team learn
  • No-fault postmortems & SLOs:

    • No two failures the same
    • Track incidents (SLIs)
    • Map to Objectives (SLOs)
  • Implement gradual change:

    • Small updates are better
    • Easier to review
    • Easier to rollback
  • Reduce costs of failures:

    • Limited "canary" rollouts
    • Impact fewest users
    • Automate where possible
  • Measure everything:

    • Critical guage of sucess
    • CI/CD needs full monitoring
    • Synthetic, proactive monitoring
  • Measure toil and reliability:

    • Key to SLOs and SLAs
    • Reduce toil, up engineering
    • Monitor all over time

SLI: "A carefully defined quantitative measure of some aspect of the level of service that is provided"

SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing

Example SLIs:

  • Request Latency: How long it takes to return a response to a request
  • Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
  • Batch Throughput - Proportion of time = data processing rate > than a threshold

Commit to Memory - Google's 4x Golden Signals!

  • Latency
    • The time is takes for your service to fulfill a request
  • Errors
    • The rate at which your service fails
  • Traffic
    • How much demand is directed at your service
  • Saturation
    • A measure of how close to fully utilized the services' resources are

LETS


SLO: "Service level objectives (SLOs) specify a target level for the reliability of your service" - The site reliability workbook

SLOs are tied to you SLIs

  • Measured by SLLI
  • Can be a single target value or range of values
  • SLIs <= SLO
  • or
  • (lower bound <= SLI <= upper bound) = SLO
  • Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)

SLI - Metric over time which detail the health of a service

  • example: Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds how often SLIs must be met

  • example: 95% percentile homepage SLI will suceed 99.9% of the time over the next year

Phases of Service Lifetime

SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:

  • Measure and track SLIs (Measuring increasing performance)
  • Evaluate reliability
  • Define SLOs
  • Build capacity models
  • Establish incident response, shared with dev team

General Availability Phase

  • After Production Readiness Review passed
  • SREs handle majority of op work
  • Incident responses
  • Track operational load and SLOs

Ops & SRE: Key Takeaways

  • SRE: Operations from a software engineer
  • Many shared pillars between DevOps/SRE
  • SLIs are quantitative metrics over time
  • Remember the 4x Google Golden Signals (LETS)
  • SLOs are a target objective for reliability
  • SLIs are lower then SLO - or - in-between upper and lower bound
  • SREs are most active in limited availability and general availability phases

Operation Services at a Glance

Section Review

Milestone: The Weight of the World (Teamwork, Not Superheroes)

Monitoring Your Operations Section Introduction Cloud Monitoring Concepts Monitoring Workspaces Concepts Monitoring Workspaces Perspective: Workspaces in Context What Are Metrics? Exploring Workspace and Metrics Monitoring Agent Concepts Installing the Monitoring Agent Collecting Monitoring Agent Metrics Integration with Monitoring API Create Dashboards with Command Line GKE Metrics Perspective: What's Up, Doc? Uptime Checks Establishing Human-Actionable and Automated Alerts Section Review Milestone: Spies Everywhere! (Check Those Vitals!) Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring Logging Activities Section Introduction Cloud Logging Fundamentals Log Types and Mechanics Cloud Logging Tour Logging Agent Concepts Install Logging Agent and Collect Agent Logs Logging Filters Hands-On with Advanced Filters VPC Flow Logs Firewall Logs VPC Flow Logs and Firewall Logs Demo Routing and Exporting Logs Export Logs to BigQuery Logs-Based Metrics Section Review Milestone: Let the Record Show Hands-On Lab: Install and Configure Logging Agent on Google Cloud SRE and Alerting Policies SLOs and Alerting Strategy Service Monitoring Milestone: Come Together, Right Now, SRE Optimize Performance with Trace/Profiler Section Introduction What the Services Do and Why They Matter Tracking Latency with Cloud Trace Accessing the Cloud Trace APIs Setting Up Your App with Cloud Profiler Analyzing Cloud Profiler Data Section Review Milestone: It All Adds Up! Hands-On Lab: Discovering Latency with Google Cloud Trace Identifying Application Errors with Debug/Error Reporting Section Introduction Troubleshooting with Cloud Debugger Establishing Error Reporting for Your App Managing Errors and Handling Notifications Section Review Milestone: Come Together - Reprise (Debug Is De Solution) Hands-On Lab: Correcting Code with Cloud Debugger Course Conclusion Milestone: Are We There, Yet? landscape Practice Exam / Quiz: Google Certified Professional Cloud DevOps Engineer Exam Prep