Started first section of Course 4

2021-02-03 16:54:24 +00:00 · 2021-02-03 16:54:24 +00:00 · c51d10402b
commit c51d10402b
parent c8a0bcabef
1 changed files with 281 additions and 0 deletions
--- a/Part_4.md
+++ b/Part_4.md
@ -0,0 +1,281 @@
 Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
 ===============================================================================================
 ### Introduction 
 #### About the Course and Learning Path
 Make better software, faster
 #### Milestone: Getting Started
 ### Understanding Operations in Context
 #### Section Introduction
 #### What Is Ops?
 - GCP Defined
  - _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_
  - Logging Management  
    - Gather Logs, metrics and traces everywhere
      - Audit, Platform, User logs
        - Export, Discard, Ingest
  - Error Reporting
    - So much data, How do you pick out the important indicators?
      - A centralized error management interface that shows current & past errors
      - Identify your app's top or new errors at a glance, in a dedicated dashboard
      - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
  - Across-the-board Monitoring
    - Dashboards for built-in and customizable visualisations
      - Monitoring Features
        - Visual Dashboards
    - Health Monitoring
      - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
    - Service Monitoring
      - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
  - SRE Tracking
    - Monitoring is critical for SRE
      - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
        - Pinpoint SLI's and develop and SLO on top of it
  - Operational Management
    - Debugging
      - Inspects the state of your application at any code location in production without stopping or slowing down requests
    - Latency Management
      - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
    - Performance Management
      - Offers continuous profiling of resource consumption in your production applications along with cost management
    - Security Management
      - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
 **What is Ops: Key Takeaways**
 - Ops Defined: Watch, learn and fix
 - Primary services: Monitoring and Logging
 - Monitoring dashboads for all metrics, including health and services (SLOs)
 - Logs can be exported, discarded, or ingested
 - SRE depends on ops
 - Error alerting pinpoints problems, quickly
 Scratch:
 - Metric query and tracing analysis
 - Establish performance and reliability indicators
 - Trigger alerts and error reporting
 - Logging Features
 - Error Reporting
 - SRE Tracking (SLI/SLO)
 - Performance Management
 #### Clarifying the Stackdriver/Operations Connection
 - 2012 - Stackdriver Created
 - 2014 - Stackdriver Acquired by Google
 - 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
 - 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
 **Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)
 "StackDriver" lives on - in the exam only
 Integration + Upgrades
 - Complete UI Integrations
  - All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
 - Dashboard API
  - New API added to allow creation and sharing of dashoards across projects
 - Log Retention Increased
  - Logs can now be retained for up to **10 years** and you have control over the time specified
 - Metrics Enhancement
  - In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
 - Advanced Alert Routing
  - Alerts can now be routed to independent systems that support Cloud Pub/Sub
 #### Operations and SRE: How Do They Relate?
 - Lots of questions in Exam on SRE
 What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)
 **Pillars of DevOps**
 - Accept failure as normal:
  - Try to anticipate, but...
  - Incidents bound to occur
  - Failures help team learn
 - No-fault postmortems &  SLOs:
  - No two failures the same
  - Track incidents (SLIs)
  - Map to Objectives (SLOs)
 - Implement gradual change:
  - Small updates are better
  - Easier to review
  - Easier to rollback
 - Reduce costs of failures:
  - Limited "canary" rollouts
  - Impact fewest users
  - Automate where possible
 - Measure everything:
  - Critical guage of sucess
  - CI/CD needs full monitoring
  - Synthetic, proactive monitoring
 - Measure toil and reliability:
  - Key to SLOs and SLAs
  - Reduce toil, up engineering
  - Monitor all over time
 <hr style="height:2px;border-width:0;color:gray;background-color:gray">
 SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_
 SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing
 Example SLIs:
 - Request Latency: How long it takes to return a response to a request
 - Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
 - Batch Throughput - Proportion of time = data processing rate > than a threshold
 **Commit to Memory - Google's 4x Golden Signals!**
 - Latency
  - The time is takes for your service to fulfill a request
 - Errors    
  - The rate at which your service fails
 - Traffic
  - How much demand is directed at your service
 - Saturation
  - A measure of how close to fully utilized the services' resources are
 > **LETS**
 <hr style="height:2px;border-width:0;color:gray;background-color:gray">
 SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook
 SLOs are tied to you SLIs
 - Measured by SLLI
 - Can be a single target value or range of values
 - SLIs <= SLO
 - or
 - (lower bound <= SLI <= upper bound) = SLO
 - Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
 SLI - Metric over time which detail the health of a service
  - example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`
 SLO - Agreed-upon bounds how often SLIs must be met
  - example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`
 Phases of Service Lifetime
 SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
 - Measure and track SLIs (Measuring increasing performance)
 - Evaluate reliability
 - Define SLOs
 - Build capacity models
 - Establish incident response, shared with dev team
 General Availability Phase
 - After Production Readiness Review passed
 - SREs handle majority of op work
 - Incident responses
 - Track operational load and SLOs
 **Ops & SRE: Key Takeaways**
 - SRE: Operations from a software engineer
 - Many shared pillars between DevOps/SRE
 - SLIs are quantitative metrics over time
 - Remember the 4x Google Golden Signals (LETS)
 - SLOs are a target objective for reliability
 - SLIs are lower then SLO - or - in-between upper and lower bound
 - SREs are most active in limited availability and general availability phases
 #### Operation Services at a Glance
 #### Section Review
 #### Milestone: The Weight of the World (Teamwork, Not Superheroes)
 Monitoring Your Operations
 Section Introduction
 Cloud Monitoring Concepts
 Monitoring Workspaces Concepts
 Monitoring Workspaces
 Perspective: Workspaces in Context
 What Are Metrics?
 Exploring Workspace and Metrics
 Monitoring Agent Concepts
 Installing the Monitoring Agent
 Collecting Monitoring Agent Metrics
 Integration with Monitoring API
 Create Dashboards with Command Line
 GKE Metrics
 Perspective: What's Up, Doc?
 Uptime Checks
 Establishing Human-Actionable and Automated Alerts
 Section Review
 Milestone: Spies Everywhere! (Check Those Vitals!)
 Hands-On Lab:
 Install and Configure Monitoring Agent with Google Cloud Monitoring
 Logging Activities
 Section Introduction
 Cloud Logging Fundamentals
 Log Types and Mechanics
 Cloud Logging Tour
 Logging Agent Concepts
 Install Logging Agent and Collect Agent Logs
 Logging Filters
 Hands-On with Advanced Filters
 VPC Flow Logs
 Firewall Logs
 VPC Flow Logs and Firewall Logs Demo
 Routing and Exporting Logs
 Export Logs to BigQuery
 Logs-Based Metrics
 Section Review
 Milestone: Let the Record Show
 Hands-On Lab:
 Install and Configure Logging Agent on Google Cloud
 SRE and Alerting Policies
 SLOs and Alerting Strategy
 Service Monitoring
 Milestone: Come Together, Right Now, SRE
 Optimize Performance with Trace/Profiler
 Section Introduction
 What the Services Do and Why They Matter
 Tracking Latency with Cloud Trace
 Accessing the Cloud Trace APIs
 Setting Up Your App with Cloud Profiler
 Analyzing Cloud Profiler Data
 Section Review
 Milestone: It All Adds Up!
 Hands-On Lab:
 Discovering Latency with Google Cloud Trace
 Identifying Application Errors with Debug/Error Reporting
 Section Introduction
 Troubleshooting with Cloud Debugger
 Establishing Error Reporting for Your App
 Managing Errors and Handling Notifications
 Section Review
 Milestone: Come Together - Reprise (Debug Is De Solution)
 Hands-On Lab:
 Correcting Code with Cloud Debugger
 Course Conclusion
 Milestone: Are We There, Yet?
 landscape
 Practice Exam / Quiz:
 Google Certified Professional Cloud DevOps Engineer Exam Prep